Diagnostic Tool#

Overview#

OpenMLDB provides a diagnostic tool to diagnose problems conveniently for users.

Usage#

Installation and usage:

pip install openmldb-tool # openmldb-tool[pb]
openmldb_tool # note the underscore

The following subcommands can be used:

usage: openmldb_tool [-h] [--helpfull] {status,inspect,rpc,test,static-check} ...

Note, for -c/--cluster, the format is<zk_cluster>/<zk_root_path>, by default is 127.0.0.1:2181/openmldb. If a different OpenMLDB cluster is used, please specify this parameter. Other settings are different as respect to each sub-command, you can use -h to check the detailed documentation.

One-click Inspect#

The openmldb_tool inspect [–cluster=0.0.0.0:2181/openmldb] command provides a comprehensive cluster status report in a single query. If a more specific perspective or additional diagnostic functionality is needed, additional subcommands can be used.

The report is divided into several sections. The “Ops and Partitions” sections will not be displayed if all tables are healthy. Users should first check the “Summary & Hint” at the end of the report. If there are instances marked as “server offline” (in red), it is necessary to restart the servers, especially TabletServers, to ensure they are online. After restarting the servers, the cluster may attempt automatic repairs, but this process could fail. Therefore, it is advisable to wait for some time before running the inspection again. If there are still unhealthy tables, users can check their status. Fatal tables need immediate attention as they may experience read and write failures. For Warn tables, users may consider postponing repairs. The repair procedures are detailed in the documentation provided at the end of the report.

The inspect command supports various configuration parameters. In addition to --cluster/-c, it can be configured to disable color display using --nocolor/-noc for easy copying. The --table_width/-tw n parameter allows configuring the table width, and --offset_diff_thresh/-od n sets the threshold for offset diff alarms.

diagnosing cluster xxx


Server Detail
{server map}
{server online/offline report}


Table Partitions Detail
tablet server order: {tablet ip -> idx}
{partition tables of unhealthy tables}
Example:
{a detailed description of partition table}


Ops Detail
> failed ops do not mean cluster is unhealthy, just for reference
last one op(check time): {}
last 10 ops != finished:
{op list}



==================
Summary & Hint
==================
Server:

{online | offline servers ['[tablet]xxx'], restart them first}

Table:
{all healthy | unhealthy tables desc}
[]Fatal/Warn table, {read/write may fail or still work}, {repair immediatly or not}
{partition detail: if leader healthy, if has unhealthy replicas, if offset too large, related ops}

    Make sure all servers online, and no ops for the table is running.
    Repair table manually, run recoverdata, check https://openmldb.ai/docs/zh/main/maintain/openmldb_ops.html.
    Check 'Table Partitions Detail' above for detail.

Other Common Commands#

In addition to the one-click inspect command, in the following scenarios, we recommend using diagnostic tool subcommands to help users assess cluster status and simplify operations:

After deploying the cluster, you can use test to check if the cluster is working correctly without the need for manual testing. If issues are identified, then use inspect for further diagnosis.
If all components are online but timeouts or error messages indicate a specific component cannot connect, use status --conn to check the connections to each component, printing out the simple access time. It can also be used to test the connection between the client host and the cluster to promptly detect network isolations.
When encountering issues with offline jobs, SHOW JOBLOG id can be used to view logs. However, users with limited experience may find irrelevant information in the logs. In such cases, inspect job can be used to extract key information from job logs.
When there are too many offline jobs and the display in the CLI becomes difficult to read, use inspect offline to filter all failed jobs or inspect job --state <state> to filter jobs with a specific state.
In more challenging problems, users may need to obtain information through RPC to help identify issues. openmldb_tool rpc can assist users in quickly and easily invoking RPC, reducing the operational threshold.
Without Prometheus monitoring, use inspect online --dist to obtain data distribution information.
If your operational nodes have passwordless SSH access to machines hosting various components, you can use static-check to verify the correctness of configuration files, ensure version uniformity, and avoid deployment failures. It can also collect logs from the entire cluster with one command, making it convenient to package and provide to developers for analysis.

Subcommand Details#

`status`#

The status command is used to view the status of the OpenMLDB cluster, including the addresses, roles, connection times, and states of the service components. It is equivalent to SHOW COMPONENTS. If you notice abnormal behavior in the cluster, check the real-time status of each service component using this command first.

openmldb_tool status -h
usage: openmldb_tool status [-h] [--helpfull] [--diff]

optional arguments:
  -h, --help  show this help message and exit
  --helpfull  show full help message and exit
  --diff      check if all endpoints in conf are in cluster. If set, need to set `-f,--conf_file`
  --conn                check network connection of all servers

Simple query for cluster status:

openmldb_tool status [--cluster=...]

The output will be similar to the table below:

+-----------------+-------------+---------------+--------+---------+
|     Endpoint    |     Role    |  Connect_time | Status | Ns_role |
+-----------------+-------------+---------------+--------+---------+
| localhost:10921 |    tablet   | 1677398926974 | online |   NULL  |
| localhost:10922 |    tablet   | 1677398926978 | online |   NULL  |
|  localhost:7527 |  nameserver | 1677398927985 | online |  master |
|  localhost:9902 | taskmanager | 1677398934773 | online |   NULL  |
+-----------------+-------------+---------------+--------+---------+

Check and test the cluster connections and versions:
```
openmldb_tool status --conn
```

Check if Configuration Files Match Cluster State#

If the --diff parameter is specified, it will check if all nodes in the configuration file are part of the already started cluster. If there are nodes in the configuration file that are not in the cluster, it will output exception information. If there are nodes in the cluster that are not in the configuration file, it will not output exception information. You need to specify -f, --conf_file. For example, you can perform this check in the image as follows:

openmldb_tool status --diff -f=/work/openmldb/conf/hosts

`inspect`#

If you want to check the cluster status, it is recommended to use the one-click inspect command to obtain a comprehensive cluster inspection report. The inspect subcommands are more targeted for specific checks.

openmldb_tool inspect -h
usage: openmldb_tool inspect [-h] [--helpfull] {online,offline,job} ...

positional arguments:
  {online,offline,job}
    online              only inspect online table.
    offline             only inspect offline jobs.
    job                 show jobs by state, show joblog or parse joblog by id.

`online`#

The inspect online command checks the health status of online tables and outputs tables with anomalies, including table status, partition information, replica information, etc. It is equivalent to SHOW TABLE STATUS and filters out tables with anomalies.

Check Online Data Distribution#

You can use inspect online --dist to check the distribution of online data. By default, it checks all databases, and you can use --db to specify the database to check. To query multiple databases, use ‘,’ to separate the database names. It will output the data distribution across nodes for each database.

`offline`#

The inspect offline command checks for tasks with a final status of failure (it does not check tasks in “running” status). It is equivalent to SHOW JOBS and filters out failed tasks. More features to be added.

JOB Inspection#

Job inspection is a more flexible offline task inspection command that allows filtering jobs based on conditions or analyzing logs for individual jobs.

Filter by state#

You can use inspect job or inspect job --state all to query all tasks, equivalent to SHOW JOBS and sorted by job_id. Using inspect job --state <state> allows you to filter logs for specific states, and you can use ‘,’ to separate and query logs for different states simultaneously. For example, inspect offline is equivalent to inspect job --state failed,killed,lost, filtering out all failed tasks.

Here are some common states:

state	Description
finished	Successfully completed task
running	Task currently running
failed	Failed task
killed	Terminated task

For more information on state, please checkSpark State, Yarn State.

Individual Job Logs#

Use inspect job --id <job_id> to query the logs of a specific task. The result will filter out the main error messages using the configuration file.

Parsing relies on the configuration file, which is automatically downloaded by default. If you need to update the configuration file, you can use --conf-update, which will force a download of the configuration file before parsing. If the default download source is not suitable, you can also configure --conf-url to set the mirror source, for example, using --conf-url https://openmldb.ai/download/diag/common_err.yml to configure a domestic mirror.

If you only need the complete log information without parsing the results, you can use --detail to get detailed information without printing the parsing results.

`test`#

The test command executes some test SQL statements, including: creating a database, creating a table, inserting data online, querying data online, deleting a table, and deleting a database. If the TaskManager component is present, it will also execute offline query tasks. Since no offline data import is performed, the query results should be empty.

You can use test to check if the cluster is working properly, especially after setting up a new cluster or if you notice abnormal cluster behavior.

`static-check`#

The static-check command performs a static check based on the cluster deployment configuration file (specified by the -f, --conf_file parameter). It logs in to the deployment addresses of various service components, collects version information, configuration files, and log files, checks if the versions are consistent, and analyzes the collected configuration and log files. This check can be performed before deploying the cluster to avoid cluster deployment failures due to program version or configuration file errors. It can also be used when the cluster is in an abnormal state to collect distributed log files for easier troubleshooting.

openmldb_tool static-check -h
usage: openmldb_tool static-check [-h] [--helpfull] [--version] [--conf] [--log]

optional arguments:
  -h, --help     show this help message and exit
  --helpfull     show full help message and exit
  --version, -V  check version
  --conf, -C     check conf
  --log, -L      check log

Deployment Configuration File#

The -f, --conf_file deployment configuration file can be in either hosts or YAML style, describing which components are in the cluster, on which nodes they are distributed, and in which deployment directories.

Hosts Style (refer to the conf/hosts file in the release package)

[tablet]
localhost:10921 /tmp/openmldb/tablet-1
localhost:10922 /tmp/openmldb/tablet-2

[nameserver]
localhost:7527

[apiserver]
localhost:9080

[taskmanager]
localhost:9902

[zookeeper]
localhost:2181:2888:3888 /tmp/openmldb/zk-1

Cluster yaml

mode: cluster
zookeeper:
  zk_cluster: 127.0.0.1:2181
  zk_root_path: /openmldb
nameserver:
  -
    endpoint: 127.0.0.1:6527
    path: /work/ns1
tablet:
  -
    endpoint: 127.0.0.1:9527
    path: /work/tablet1
  -
    endpoint: 127.0.0.1:9528
    path: /work/tablet2
taskmanager:
  -
    endpoint: 127.0.0.1:9902
    path: /work/taskmanager1

If it’s a distributed deployment, the diagnostic tool needs to fetch files from the deployment nodes, so it requires setting up passwordless SSH between machines. You can follow the instructions here.

If some components in the hosts/yaml file are not configured with a path, the tool will use --default_dir as the deployment directory, with a default value of /work/openmldb. If your deployment directory is different, you can specify it using --default_dir.

For onebox deployment, you can specify --local, and all nodes will be treated as local nodes without attempting SSH login. If only some nodes are local, you can only use the YAML format for the deployment configuration file, and configure local nodes with is_local: true. For example:

nameserver:
  -
    endpoint: 127.0.0.1:6527
    path: /work/ns1
    is_local: true

Check Content#

The check can be specified by combining flags to indicate which content to check. For example, -V checks only the version, -CL checks only the configuration files and logs, and -VCL checks everything.

-V, --version: Checks the version, ensuring that the versions of various components are consistent. If inconsistent, it will output information about the inconsistent components and versions (due to the higher complexity, the address of the openmldb-batch package may not be found and will be ignored in the check; checking the batch package can be postponed, and replacement is straightforward).
-C, --conf: Collects configuration files, checking if the ZooKeeper addresses in the configuration files of various components are consistent, etc.
-L, --log: Collects logs, outputting logs with WARNING and above.

If checking configuration files or logs, the collected files will be saved in --collect_dir, which defaults to /tmp/diag_collect. You can also access this directory to view the collected configurations or logs for further analysis.

Check Example#

In a containerized image, you can perform static checks as follows:

openmldb_tool static-check --conf_file=/work/openmldb/conf/hosts -VCL --local

RPC Interface#

openmldb_tool also provides an RPC interface, making it easier to send RPCs without needing to locate the server’s IP, concatenate RPC method URL paths, and obtain information about all RPC methods and their input structures. The usage is openmldb_tool rpc, for example, openmldb_tool rpc ns ShowTable --field '{"show_all":true}' can invoke the ShowTable interface of the nameserver to get information about the table’s state.

In the case of components, you can use role names directly instead of IP addresses. There is only one active NameServer and TaskManager, so we use ns and tm to represent these two components. However, there are multiple TabletServers, and we can use tablet1, tablet2, etc., to specify a particular TabletServer, starting from 1. The order can be viewed through openmldb_tool rpc or openmldb_tool status.

If you are not familiar with the methods or input parameters of the RPC service, you can use openmldb_tool rpc <component> [method] --hint to view the help information. However, this is an additional component and needs to be installed through pip install openmldb-tool[pb]. The hint also requires additional pb files for parsing input parameters, which are read from /tmp/diag_cache by default. If it does not exist, it will be automatically downloaded. If you already have the corresponding files or have manually downloaded them, you can specify the directory with --pbdir. For self-compilation of pb files, refer to the OpenMLDB tool development documentation.

For example:

$ openmldb_tool rpc ns ShowTable --hint
...
server proto version is 0.7.0-e1d35fcf6
hint use pb2 files from /tmp/diag_cache
You should input json like this, ignore round brackets in the key and double quotation marks in the value: --field '{
    "(optional)name": "string",
    "(optional)db": "string",
    "(optional)show_all": "bool"
}'

Additional Information#

You can use openmldb_tool --helpfull to view all configuration options. For example, --sdk_log can print SDK logs (zk, glog) for debugging purposes.

Diagnostic Tool

Contents