Client FAQ
Contents
Client FAQ#
“fail to get tablet …”#
Prioritize the checking on whether the tablet server in the cluster is unexpectedly offline or if online tables are not readable or writable. It is recommended to use openmldb_tool for diagnosis, utilizing the status
(status –diff), and inspect online
commands for checks.
If the TODO diag tool detects abnormal conditions in offline or online tables, it will output warnings and suggest the next steps.
If manual inspection is required, follow these two steps:
Execute
show components
to check if the server is in the list. If TaskManager is offline, it will not be on the list. If Tablet is offline, it will be in the list but with a status of offline. If there are offline servers, restart the server and add it back to the cluster.Execute
show table status like '%'
(if the lower version does not supportlike
, query the system db and user db separately). Check if the “Warnings” for each table report any errors.
Common errors may include messages like real replica number X does not match the configured replicanum X
. For detailed error information, please refer to SHOW TABLE STATUS. These errors indicate that the table is currently experiencing issues, rendering it unable to provide normal read and write functions. This is typically due to Tablet issues.
Why Do I Receive Warnings of “Reached timeout …”?#
rpc_client.h:xxx] request error. [E1008] Reached timeout=xxxms
This is because the timeout setting for the RPC request sent by the client-side is too short, and the client-side actively disconnects. Note that this is the timeout for RPC. You need to change the general request_timeout
configuration.
CLI: Configure
--request_timeout_ms
at startup.JAVA/Python SDK: Adjust
SdkOption.requestTimeout
in Option or URL.
Note
This error usually does not occur with synchronized offline commands, as the timeout of the synchronized offline command is set to the maximum acceptable time for TaskManager.
Why Do I Receive Warnings of “Got EOF of Socket …”?#
rpc_client.h:xxx] request error. [E1014]Got EOF of Socket{id=x fd=x addr=xxx} (xx)
This is because the addr
end is actively disconnected, and the addr
address is most likely the TaskManager. This does not mean that the TaskManager is abnormal, but rather that the TaskManager end considers this connection inactive, exceeds the keepAliveTime
, and actively disconnects the communication channel.
In versions 0.5.0 and later, you can increase the server.channel_keep_alive_time
of the TaskManager to improve tolerance for inactive channels. The default value is 1800 seconds (0.5 hours). Especially when using synchronous offline commands, this value may need to be adjusted appropriately.
In versions before 0.5.0, this configuration cannot be changed. Please upgrade the TaskManager version.
Why is the Offline Query Result Displaying Chinese Characters as Garbled?#
When using offline queries, there may be issues with garbled query results containing Chinese characters. This is mainly related to the system’s default encoding format and the encoding format parameters of Spark tasks.
If you encounter garbled characters, you can resolve this by adding the Spark advanced parameters spark.driver.extraJavaOptions=-Dfile.encoding=utf-8
and spark.executor.extraJavaOptions=-Dfile.encoding=utf-8
.
For client configuration methods, you can refer to the Spark Client Configuration, or you can add this configuration to the TaskManager configuration file.
spark.default.conf=spark.driver.extraJavaOptions=-Dfile.encoding=utf-8;spark.executor.extraJavaOptions=-Dfile.encoding=utf-8
How to Configure TaskManager to Access a YARN Cluster with Kerberos Enabled?#
If the YARN cluster has Kerberos authentication enabled, TaskManager can access the YARN cluster with Kerberos authentication by adding the following configuration. Please note to modify the keytab
path and principal
account according to the actual configuration.
spark.default.conf=spark.yarn.keytab=/tmp/test.keytab;spark.yarn.principal=test@EXAMPLE.COM
How to Configure Client’s Core Logs?#
Client core logs mainly consist of two types: ZooKeeper logs and SDK logs (glog logs), and they are independent of each other.
ZooKeeper Logs:
CLI: Configure
--zk_log_level
during startup to adjust the log level, and use--zk_log_file
to specify the log file.JAVA/Python SDK: Use
zkLogLevel
to adjust the level andzkLogFile
to specify the log file in Option or URL.
zk_log_level
(int, default=0, i.e., DISABLE_LOGGING): Prints logs at this level and below. 0 - disable all zk logs, 1 - error, 2 - warn, 3 - info, 4 - debug.
SDK Logs (glog Logs):
CLI: Configure
--glog_level
during startup to adjust the level, and use--glog_dir
to specify the log file.JAVA/Python SDK: Use
glogLevel
to adjust the level andglogDir
to specify the log file in Option or URL.
glog_level
(int, default=1, i.e., WARNING): Prints logs at this level and above. INFO, WARNING, ERROR, and FATAL logs correspond to 0, 1, 2, and 3, respectively.
Insert Error with Log please use getInsertRow with ... first
.#
When using InsertPreparedStatement
for insertion in the JAVA client or inserting with SQL and parameters in Python, there is an underlying cache effect in the client. The process involves generating SQL cache with the first step getInsertRow
and returning the SQL along with the parameter information to be completed. The second step actually executes the insert, and it requires using the SQL cache cached in the first step. Therefore, when multiple threads use the same client, it’s possible that frequent updates to the cache table due to frequent insertions and queries might evict the SQL cache you want to execute, causing it to seem like the first step getInsertRow
was not executed.
Currently, you can avoid this issue by increasing the maxSqlCacheSize
configuration option. This option is only supported in the JAVA/Python SDKs.
Offline Command Error#
java.lang.OutOfMemoryError: Java heap space
Container killed by YARN for exceeding memory limits. 5 GB of 5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When encountering the aforementioned log messages, it indicates that the offline task requires more resources than the current configuration provides. This typically occurs in the following situations:
The Spark configuration for the offline command is set to
local[*]
, the machine has a high number of cores, and the concurrency is too high, resulting in excessive resource consumption.The memory configuration is too small.
If using local mode and the resources on a single machine are limited, consider reducing concurrency. If you choose not to reduce concurrency, adjust the spark.driver.memory
and spark.executor.memory
Spark configuration options. You can write these configurations in the conf/taskmanager.properties
file in the TaskManager’s running directory, restart the TaskManager, or use the CLI client for configuration. For more information, refer to the Spark Client Configuration.
spark.default.conf=spark.driver.memory=16g;spark.executor.memory=16g
When the master is local, adjust the memory of the driver, not the executor. If you are unsure, you can adjust both.