Client FAQ#

“fail to get tablet …”#

Prioritize the checking on whether the tablet server in the cluster is unexpectedly offline or if online tables are not readable or writable. It is recommended to use openmldb_tool for diagnosis, utilizing the status (status –diff), and inspect online commands for checks. If the TODO diag tool detects abnormal conditions in offline or online tables, it will output warnings and suggest the next steps. If manual inspection is required, follow these two steps:

  • Execute show components to check if the server is in the list. If TaskManager is offline, it will not be on the list. If Tablet is offline, it will be in the list but with a status of offline. If there are offline servers, restart the server and add it back to the cluster.

  • Execute show table status like '%' (if the lower version does not support like, query the system db and user db separately). Check if the “Warnings” for each table report any errors.

Common errors may include messages like real replica number X does not match the configured replicanum X. For detailed error information, please refer to SHOW TABLE STATUS. These errors indicate that the table is currently experiencing issues, rendering it unable to provide normal read and write functions. This is typically due to Tablet issues.

Why Do I Receive Warnings of “Reached timeout …”?#

rpc_client.h:xxx] request error. [E1008] Reached timeout=xxxms

This is because the timeout setting for the RPC request sent by the client-side is too short, and the client-side actively disconnects. Note that this is the timeout for RPC. You need to change the general request_timeout configuration.

  1. CLI: Configure --request_timeout_ms at startup.

  2. JAVA/Python SDK: Adjust SdkOption.requestTimeout in Option or URL.

Note

This error usually does not occur with synchronized offline commands, as the timeout of the synchronized offline command is set to the maximum acceptable time for TaskManager.

Why Do I Receive Warnings of “Got EOF of Socket …”?#

rpc_client.h:xxx] request error. [E1014]Got EOF of Socket{id=x fd=x addr=xxx} (xx)

This is because the addr end is actively disconnected, and the addr address is most likely the TaskManager. This does not mean that the TaskManager is abnormal, but rather that the TaskManager end considers this connection inactive, exceeds the keepAliveTime, and actively disconnects the communication channel.

In versions 0.5.0 and later, you can increase the server.channel_keep_alive_time of the TaskManager to improve tolerance for inactive channels. The default value is 1800 seconds (0.5 hours). Especially when using synchronous offline commands, this value may need to be adjusted appropriately.

In versions before 0.5.0, this configuration cannot be changed. Please upgrade the TaskManager version.

Why is the Offline Query Result Displaying Chinese Characters as Garbled?#

When using offline queries, there may be issues with garbled query results containing Chinese characters. This is mainly related to the system’s default encoding format and the encoding format parameters of Spark tasks.

If you encounter garbled characters, you can resolve this by adding the Spark advanced parameters spark.driver.extraJavaOptions=-Dfile.encoding=utf-8 and spark.executor.extraJavaOptions=-Dfile.encoding=utf-8.

For client configuration methods, you can refer to the Spark Client Configuration, or you can add this configuration to the TaskManager configuration file.

spark.default.conf=spark.driver.extraJavaOptions=-Dfile.encoding=utf-8;spark.executor.extraJavaOptions=-Dfile.encoding=utf-8

How to Configure TaskManager to Access a YARN Cluster with Kerberos Enabled?#

If the YARN cluster has Kerberos authentication enabled, TaskManager can access the YARN cluster with Kerberos authentication by adding the following configuration. Please note to modify the keytab path and principal account according to the actual configuration.

spark.default.conf=spark.yarn.keytab=/tmp/test.keytab;spark.yarn.principal=test@EXAMPLE.COM

How to Configure Client’s Core Logs?#

Client core logs mainly consist of two types: ZooKeeper logs and SDK logs (glog logs), and they are independent of each other.

ZooKeeper Logs:

  1. CLI: Configure --zk_log_level during startup to adjust the log level, and use --zk_log_file to specify the log file.

  2. JAVA/Python SDK: Use zkLogLevel to adjust the level and zkLogFile to specify the log file in Option or URL.

  • zk_log_level (int, default=0, i.e., DISABLE_LOGGING): Prints logs at this level and below. 0 - disable all zk logs, 1 - error, 2 - warn, 3 - info, 4 - debug.

SDK Logs (glog Logs):

  1. CLI: Configure --glog_level during startup to adjust the level, and use --glog_dir to specify the log file.

  2. JAVA/Python SDK: Use glogLevel to adjust the level and glogDir to specify the log file in Option or URL.

  • glog_level (int, default=1, i.e., WARNING): Prints logs at this level and above. INFO, WARNING, ERROR, and FATAL logs correspond to 0, 1, 2, and 3, respectively.

Insert Error with Log please use getInsertRow with ... first.#

When using InsertPreparedStatement for insertion in the JAVA client or inserting with SQL and parameters in Python, there is an underlying cache effect in the client. The process involves generating SQL cache with the first step getInsertRow and returning the SQL along with the parameter information to be completed. The second step actually executes the insert, and it requires using the SQL cache cached in the first step. Therefore, when multiple threads use the same client, it’s possible that frequent updates to the cache table due to frequent insertions and queries might evict the SQL cache you want to execute, causing it to seem like the first step getInsertRow was not executed.

Currently, you can avoid this issue by increasing the maxSqlCacheSize configuration option. This option is only supported in the JAVA/Python SDKs.

Offline Command Error#

java.lang.OutOfMemoryError: Java heap space
Container killed by YARN for exceeding memory limits. 5 GB of 5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

When encountering the aforementioned log messages, it indicates that the offline task requires more resources than the current configuration provides. This typically occurs in the following situations:

  • The Spark configuration for the offline command is set to local[*], the machine has a high number of cores, and the concurrency is too high, resulting in excessive resource consumption.

  • The memory configuration is too small.

If using local mode and the resources on a single machine are limited, consider reducing concurrency. If you choose not to reduce concurrency, adjust the spark.driver.memory and spark.executor.memory Spark configuration options. You can write these configurations in the conf/taskmanager.properties file in the TaskManager’s running directory, restart the TaskManager, or use the CLI client for configuration. For more information, refer to the Spark Client Configuration.

spark.default.conf=spark.driver.memory=16g;spark.executor.memory=16g

When the master is local, adjust the memory of the driver, not the executor. If you are unsure, you can adjust both.