Diagnostic Tool
Contents
Diagnostic Tool#
Overview#
OpenMLDB provides a diagnostic tool to diagnose problems conveniently for users.
Usage#
Installation and usage:
pip install openmldb-tool # openmldb-tool[pb]
openmldb_tool # note the underscore
The following subcommands can be used:
usage: openmldb_tool [-h] [--helpfull] {status,inspect,rpc,test,static-check} ...
Note, for -c/--cluster
, the format is<zk_cluster>/<zk_root_path>
, by default is 127.0.0.1:2181/openmldb
. If a different OpenMLDB cluster is used, please specify this parameter. Other settings are different as respect to each sub-command, you can use -h
to check the detailed documentation.
One-click Inspect#
The openmldb_tool inspect [–cluster=0.0.0.0:2181/openmldb] command provides a comprehensive cluster status report in a single query. If a more specific perspective or additional diagnostic functionality is needed, additional subcommands can be used.
The report is divided into several sections. The “Ops and Partitions” sections will not be displayed if all tables are healthy. Users should first check the “Summary & Hint” at the end of the report. If there are instances marked as “server offline” (in red), it is necessary to restart the servers, especially TabletServers, to ensure they are online. After restarting the servers, the cluster may attempt automatic repairs, but this process could fail. Therefore, it is advisable to wait for some time before running the inspection again. If there are still unhealthy tables, users can check their status. Fatal tables need immediate attention as they may experience read and write failures. For Warn tables, users may consider postponing repairs. The repair procedures are detailed in the documentation provided at the end of the report.
The inspect command supports various configuration parameters. In addition to --cluster/-c
, it can be configured to disable color display using --nocolor/-noc
for easy copying. The --table_width/-tw n
parameter allows configuring the table width, and --offset_diff_thresh/-od n
sets the threshold for offset diff alarms.
diagnosing cluster xxx
Server Detail
{server map}
{server online/offline report}
Table Partitions Detail
tablet server order: {tablet ip -> idx}
{partition tables of unhealthy tables}
Example:
{a detailed description of partition table}
Ops Detail
> failed ops do not mean cluster is unhealthy, just for reference
last one op(check time): {}
last 10 ops != finished:
{op list}
==================
Summary & Hint
==================
Server:
{online | offline servers ['[tablet]xxx'], restart them first}
Table:
{all healthy | unhealthy tables desc}
[]Fatal/Warn table, {read/write may fail or still work}, {repair immediatly or not}
{partition detail: if leader healthy, if has unhealthy replicas, if offset too large, related ops}
Make sure all servers online, and no ops for the table is running.
Repair table manually, run recoverdata, check https://openmldb.ai/docs/zh/main/maintain/openmldb_ops.html.
Check 'Table Partitions Detail' above for detail.
Other Common Commands#
In addition to the one-click inspect command, in the following scenarios, we recommend using diagnostic tool subcommands to help users assess cluster status and simplify operations:
After deploying the cluster, you can use
test
to check if the cluster is working correctly without the need for manual testing. If issues are identified, then useinspect
for further diagnosis.If all components are online but timeouts or error messages indicate a specific component cannot connect, use
status --conn
to check the connections to each component, printing out the simple access time. It can also be used to test the connection between the client host and the cluster to promptly detect network isolations.When encountering issues with offline jobs,
SHOW JOBLOG id
can be used to view logs. However, users with limited experience may find irrelevant information in the logs. In such cases,inspect job
can be used to extract key information from job logs.When there are too many offline jobs and the display in the CLI becomes difficult to read, use
inspect offline
to filter all failed jobs orinspect job --state <state>
to filter jobs with a specific state.In more challenging problems, users may need to obtain information through RPC to help identify issues.
openmldb_tool rpc
can assist users in quickly and easily invoking RPC, reducing the operational threshold.Without Prometheus monitoring, use
inspect online --dist
to obtain data distribution information.If your operational nodes have passwordless SSH access to machines hosting various components, you can use
static-check
to verify the correctness of configuration files, ensure version uniformity, and avoid deployment failures. It can also collect logs from the entire cluster with one command, making it convenient to package and provide to developers for analysis.
Subcommand Details#
status
#
The status
command is used to view the status of the OpenMLDB cluster, including the addresses, roles, connection times, and states of the service components. It is equivalent to SHOW COMPONENTS
. If you notice abnormal behavior in the cluster, check the real-time status of each service component using this command first.
openmldb_tool status -h
usage: openmldb_tool status [-h] [--helpfull] [--diff]
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--diff check if all endpoints in conf are in cluster. If set, need to set `-f,--conf_file`
--conn check network connection of all servers
Simple query for cluster status:
openmldb_tool status [--cluster=...]
The output will be similar to the table below:
+-----------------+-------------+---------------+--------+---------+ | Endpoint | Role | Connect_time | Status | Ns_role | +-----------------+-------------+---------------+--------+---------+ | localhost:10921 | tablet | 1677398926974 | online | NULL | | localhost:10922 | tablet | 1677398926978 | online | NULL | | localhost:7527 | nameserver | 1677398927985 | online | master | | localhost:9902 | taskmanager | 1677398934773 | online | NULL | +-----------------+-------------+---------------+--------+---------+
Check and test the cluster connections and versions:
openmldb_tool status --conn
Check if Configuration Files Match Cluster State#
If the --diff
parameter is specified, it will check if all nodes in the configuration file are part of the already started cluster. If there are nodes in the configuration file that are not in the cluster, it will output exception information. If there are nodes in the cluster that are not in the configuration file, it will not output exception information. You need to specify -f, --conf_file
. For example, you can perform this check in the image as follows:
openmldb_tool status --diff -f=/work/openmldb/conf/hosts
inspect
#
If you want to check the cluster status, it is recommended to use the one-click inspect
command to obtain a comprehensive cluster inspection report. The inspect
subcommands are more targeted for specific checks.
openmldb_tool inspect -h
usage: openmldb_tool inspect [-h] [--helpfull] {online,offline,job} ...
positional arguments:
{online,offline,job}
online only inspect online table.
offline only inspect offline jobs.
job show jobs by state, show joblog or parse joblog by id.
online
#
The inspect online
command checks the health status of online tables and outputs tables with anomalies, including table status, partition information, replica information, etc. It is equivalent to SHOW TABLE STATUS
and filters out tables with anomalies.
Check Online Data Distribution#
You can use inspect online --dist
to check the distribution of online data. By default, it checks all databases, and you can use --db
to specify the database to check. To query multiple databases, use ‘,’ to separate the database names. It will output the data distribution across nodes for each database.
offline
#
The inspect offline
command checks for tasks with a final status of failure (it does not check tasks in “running” status). It is equivalent to SHOW JOBS
and filters out failed tasks. More features to be added.
JOB Inspection#
Job inspection is a more flexible offline task inspection command that allows filtering jobs based on conditions or analyzing logs for individual jobs.
Filter by state#
You can use inspect job
or inspect job --state all
to query all tasks, equivalent to SHOW JOBS
and sorted by job_id. Using inspect job --state <state>
allows you to filter logs for specific states, and you can use ‘,’ to separate and query logs for different states simultaneously. For example, inspect offline
is equivalent to inspect job --state failed,killed,lost
, filtering out all failed tasks.
Here are some common states:
state |
Description |
---|---|
finished |
Successfully completed task |
running |
Task currently running |
failed |
Failed task |
killed |
Terminated task |
For more information on state, please checkSpark State, Yarn State.
Individual Job Logs#
Use inspect job --id <job_id>
to query the logs of a specific task. The result will filter out the main error messages using the configuration file.
Parsing relies on the configuration file, which is automatically downloaded by default. If you need to update the configuration file, you can use --conf-update
, which will force a download of the configuration file before parsing. If the default download source is not suitable, you can also configure --conf-url
to set the mirror source, for example, using --conf-url https://openmldb.ai/download/diag/common_err.yml
to configure a domestic mirror.
If you only need the complete log information without parsing the results, you can use --detail
to get detailed information without printing the parsing results.
test
#
The test
command executes some test SQL statements, including: creating a database, creating a table, inserting data online, querying data online, deleting a table, and deleting a database. If the TaskManager component is present, it will also execute offline query tasks. Since no offline data import is performed, the query results should be empty.
You can use test
to check if the cluster is working properly, especially after setting up a new cluster or if you notice abnormal cluster behavior.
static-check
#
The static-check
command performs a static check based on the cluster deployment configuration file (specified by the -f, --conf_file
parameter). It logs in to the deployment addresses of various service components, collects version information, configuration files, and log files, checks if the versions are consistent, and analyzes the collected configuration and log files. This check can be performed before deploying the cluster to avoid cluster deployment failures due to program version or configuration file errors. It can also be used when the cluster is in an abnormal state to collect distributed log files for easier troubleshooting.
openmldb_tool static-check -h
usage: openmldb_tool static-check [-h] [--helpfull] [--version] [--conf] [--log]
optional arguments:
-h, --help show this help message and exit
--helpfull show full help message and exit
--version, -V check version
--conf, -C check conf
--log, -L check log
Deployment Configuration File#
The -f, --conf_file
deployment configuration file can be in either hosts or YAML style, describing which components are in the cluster, on which nodes they are distributed, and in which deployment directories.
Hosts Style (refer to the conf/hosts file in the release package)
[tablet]
localhost:10921 /tmp/openmldb/tablet-1
localhost:10922 /tmp/openmldb/tablet-2
[nameserver]
localhost:7527
[apiserver]
localhost:9080
[taskmanager]
localhost:9902
[zookeeper]
localhost:2181:2888:3888 /tmp/openmldb/zk-1
Cluster yaml
mode: cluster
zookeeper:
zk_cluster: 127.0.0.1:2181
zk_root_path: /openmldb
nameserver:
-
endpoint: 127.0.0.1:6527
path: /work/ns1
tablet:
-
endpoint: 127.0.0.1:9527
path: /work/tablet1
-
endpoint: 127.0.0.1:9528
path: /work/tablet2
taskmanager:
-
endpoint: 127.0.0.1:9902
path: /work/taskmanager1
If it’s a distributed deployment, the diagnostic tool needs to fetch files from the deployment nodes, so it requires setting up passwordless SSH between machines. You can follow the instructions here.
If some components in the hosts/yaml file are not configured with a path, the tool will use --default_dir
as the deployment directory, with a default value of /work/openmldb
. If your deployment directory is different, you can specify it using --default_dir
.
For onebox deployment, you can specify --local
, and all nodes will be treated as local nodes without attempting SSH login. If only some nodes are local, you can only use the YAML format for the deployment configuration file, and configure local nodes with is_local: true
. For example:
nameserver:
-
endpoint: 127.0.0.1:6527
path: /work/ns1
is_local: true
Check Content#
The check can be specified by combining flags to indicate which content to check. For example, -V
checks only the version, -CL
checks only the configuration files and logs, and -VCL
checks everything.
-V, --version
: Checks the version, ensuring that the versions of various components are consistent. If inconsistent, it will output information about the inconsistent components and versions (due to the higher complexity, the address of the openmldb-batch package may not be found and will be ignored in the check; checking the batch package can be postponed, and replacement is straightforward).-C, --conf
: Collects configuration files, checking if the ZooKeeper addresses in the configuration files of various components are consistent, etc.-L, --log
: Collects logs, outputting logs with WARNING and above.
If checking configuration files or logs, the collected files will be saved in --collect_dir
, which defaults to /tmp/diag_collect
. You can also access this directory to view the collected configurations or logs for further analysis.
Check Example#
In a containerized image, you can perform static checks as follows:
openmldb_tool static-check --conf_file=/work/openmldb/conf/hosts -VCL --local
RPC Interface#
openmldb_tool
also provides an RPC interface, making it easier to send RPCs without needing to locate the server’s IP, concatenate RPC method URL paths, and obtain information about all RPC methods and their input structures. The usage is openmldb_tool rpc
, for example, openmldb_tool rpc ns ShowTable --field '{"show_all":true}'
can invoke the ShowTable
interface of the nameserver
to get information about the table’s state.
In the case of components, you can use role names directly instead of IP addresses. There is only one active NameServer and TaskManager, so we use ns
and tm
to represent these two components. However, there are multiple TabletServers, and we can use tablet1
, tablet2
, etc., to specify a particular TabletServer, starting from 1. The order can be viewed through openmldb_tool rpc
or openmldb_tool status
.
If you are not familiar with the methods or input parameters of the RPC service, you can use openmldb_tool rpc <component> [method] --hint
to view the help information. However, this is an additional component and needs to be installed through pip install openmldb-tool[pb]
. The hint also requires additional pb files for parsing input parameters, which are read from /tmp/diag_cache
by default. If it does not exist, it will be automatically downloaded. If you already have the corresponding files or have manually downloaded them, you can specify the directory with --pbdir
. For self-compilation of pb files, refer to the OpenMLDB tool development documentation.
For example:
$ openmldb_tool rpc ns ShowTable --hint
...
server proto version is 0.7.0-e1d35fcf6
hint use pb2 files from /tmp/diag_cache
You should input json like this, ignore round brackets in the key and double quotation marks in the value: --field '{
"(optional)name": "string",
"(optional)db": "string",
"(optional)show_all": "bool"
}'
Additional Information#
You can use openmldb_tool --helpfull
to view all configuration options. For example, --sdk_log
can print SDK logs (zk, glog) for debugging purposes.