Introduction

OpenMLDB has realized the connection with Hive, an open-source data warehouse software. After completing the integration with Kafka, Pulsar, RocketMQ, and other real-time data sources, OpenMLDB continues to build an offline data ecosystem. It is expected to build a more comprehensive ecosystem, which can lower the threshold of users while attracting more users.

OpenMLDB Hive Connector

Background

The development of OpenMLDB Hive Connector has solved the issue of being unable to connect easily and using Hive data sources in OpenMLDB. It can be used to simply read and write Hive data, optimize the user experience and reduce the timely costs.

Advantages

After the connector connects to Hive, you can read and write Hive tables in OpenMLDB in the hive:// mode by configuring Hive.
Low learning cost and easy to use. Hive is the SQL interface on Apache Hadoop. You can write SQL statements to process and calculate data.

Note

Before using the Hive Connector, the Hive Metastore Service (HMS) must be started.
Currently, the connector only supports importing the entire table and partition import is not supported.

Configuration

OpenMLDB version: v0.6.7 and above

Key processes

Start Hive and HMS
Use OpenMLDB to configure the Hive metastore URI
Start OpenMLDB cluster
Test — Import data from Hive to OpenMLDB, export data from OpenMLDB to Hive table

Steps

Docker pulls and runs the OpenMLDB image (the image download size is about 1GB, and after decompression, it is about 1.7 GB):

docker run -it 4pdosc/openmldb:0.6.7 bash

Installation support

a. Install jdk8: the default support in the Docker image is jdk11. Hive does not support jdk11. It is recommended to install jdk8:

apt-get update && apt-get install -y gnupg2
curl -s https://adoptopenjdk.jfrog.io/adoptopenjdk/api/gpg/key/public | apt-key add -
apt-get update && apt-get install -y adoptopenjdk-8-hotspot
export JAVA_HOME=/usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
export PATH=$JAVA_HOME/bin:$PATH

b. Install Hadoop：

curl -SLO https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz
tar zxf hadoop-3.2.3.tar.gz
export HADOOP_HOME=`pwd`/hadoop-3.2.3
curl -SLO http://mirror.bit.edu.cn/apache/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar zxf apache-hive-3.1.3-bin.tar.gz
export HIVE_HOME=`pwd`/apache-hive-3.1.3-bin

c. Because the Guava versions of Hadoop and Hive are different, run the following command to adjust to the consistent version:

rm $HIVE_HOME/lib/guava-19.0.jar
cp $HADOOP_HOME/share/hadoop/common/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
$HIVE_HOME/bin/schematool -dbType derby -initSchema

Creating tables using Hive client-side

Use Hive client-side to create table htable (the table will be created to the default database. If you need to use other databases, please ensure that the database already exists, or OpenMLDB will not automatically create a Hive database for you); Insert a piece of data and then query the inserted data for test:

$HIVE_HOME/bin/hive
CREATE TABLE htable(key INT, value STRING NOT NULL);
INSERT INTO htable VALUES (NULL, "n"), (1,"a"), (2,"b");
SELECT * FROM htable;

Start HMS (the default port is 9083):

$HIVE_HOME/bin/hive --service metastore > metastore.log 2>&1 &

Use OpenMLDB to configure the Hive connection, and start the OpenMLDB cluster

There are many ways to configure the Hive connection. Please refer to Hive configuration document (https://openmldb.ai/docs/zh/main/reference/sql/dml/LOAD_DATA_STATEMENT.html#id1) for details . The method of “Add Hive connection configuration in the taskmanager configuration file” is used here.

a. Open taskmanager configuation document：

vim openmldb/conf/taskmanager.properties

b. At Spark.default.conf configuration, add the line, spark.hadoop.hive.metastore.uris to the conf configuration item:

spark.default.conf=...;spark.hadoop.hive.metastore.uris=thrift://localhost:9083

c. Start OpenMLDB Cluster：

/work/init.sh

Import data from Hive to OpenMLDB

You need to create the OpenMLDB table of otable, using the LOAD DATA statement, you can import the data in the table to the OpenMLDB table by using the hide:// prefix and specifying the Hive table name. It supports importing online and offline. Run the following code to import offline:

CREATE DATABASE db;
USE db;
CREATE TABLE otable(key INT, value STRING NOT NULL);
SET @@sync_job=true;
SET @@job_timeout=600000;
LOAD DATA INFILE 'hive://htable' INTO TABLE otable OPTIONS(mode='overwrite');

Export OpenMLDB Table Data to Hive

When using the SELECT INTO statement, the same path only needs to use the hide:// prefix and specify the table name (the exist of the table is not required and will be created automatically).

USE db;
SET @@sync_job=true;
SET @@job_timeout=600000;
SELECT * FROM otable INTO OUTFILE 'hive://hsaved' OPTIONS(mode='overwrite');

You can query hsaved table in Hive. Since the demo uses derby as the Metabase, please close the metastore process before using the $HIVE_ HOME/bin/hive query:

SELECT * FROM hsaved;

You can also view hsaved table directly in the Hive saved directory：

ls /user/hive/warehouse/hsaved

Conclusion

About OpenMLDB

OpenMLDB is an open-source machine learning database committed to solving the data governance problem of AI engineering implementation in a closed-loop. It provides enterprises with a full stack of low-threshold characteristic data computing and management platform. OpenMLDB includes all the functions of the Feature Store and provides a more complete full-stack solution of FeatureOps. In addition to feature storage, it also has SQL-based low threshold database development experience, OpenMLDB Spark distribution for feature computing optimization, index structure optimized for real-time feature computing, feature online service, enterprise-level operation, and maintenance, and management functions, returning feature engineering development to its essence. It focuses on the high-quality feature computing of script development.

About Hive

Hive is a Hadoop-based data warehouse tool implemented and open-sourced by Facebook. Hive essentially converts SQL statements into MapReduce tasks to run, allowing users unfamiliar with MapReduce programming model to quickly use HiveQL to process and calculate structured data stored on HDFS, which is suitable for offline batch data calculation. Hive is suitable for traditional digital warehouse services, not for low-latency interactive access.

OpenMLDB ecosystem

To reduce the threshold for developers to use OpenMLDB, the OpenMLDB community will continue to build an ecosystem for upstream and downstream technology components, providing developers with more easy-to-use ecological connectors.

The online data sources: Pulsar, Kafka, RocketMQ, Flink, RabbitMQ, etc.
The offline data sources: HDFS, Hive, MaxCompute, HBase, Cassandra, S3, etc.
The algorithms and frameworks for model building: XGBoost, LightGBM, TensorFlow, PyTorch, OneFlow, ScikitLearn, etc.
The scheduling framework and deployment tools for the whole process of machine learning Modeling: DolphinScheduler, Airflow, Byzer, Kubeflow, Prometheus, Grafana, etc.

OpenMLDB Document：https://openmldb.ai/docs/zh/main/

OpenMLDB Hive Connector Product Document：http://openmldb.ai/docs/zh/main/reference/sql/dml/LOAD_DATA_STATEMENT.html#hive

If you have relevant questions, do join the Slack communication group: https://openmldb.slack.com/join/shared_invite/zt-ozu3llie-K~hn9Ss1GZcFW2~K_L5sMg#/shared-invite/email