One of the key points for the engineering implementation of artificial intelligence is to solve the problems of real-time batch estimation and real-time model updating of real business scenarios. Better and faster transformation of online real-time data into AI usable features will accelerate the efficiency and effect of AI application landing. To this end, OpenMLDB and Apache Pulsar jointly launched the OpenMLDB Pulsar Connector to achieve stable streaming integration and provide a clear path to efficiently get through real-time data to feature engineering.

About OpenMLDB

OpenMLDB is an open-source machine learning database, which is committed to solving the data governance problems of AI engineering in a closed-loop. Since open-sourced in June 2021, OpenMLDB has given priority to open-source feature data governance. Relying on the development ability of SQL, it provides enterprises with a full-stack feature data calculation and management platform with a low threshold.

OpenMLDB includes all features of the Feature Store and provides a more complete FeatureOps full-stack solution. In addition to the feature storage function, it also has the experience of low threshold database development based on SQL, the OpenMLDB Spark release for feature computing optimization, the index structure optimized for real-time feature computing, feature online service, enterprise-level operation and maintenance, and management functions, so that feature engineering development returns to its essence, focusing on the development of high-quality feature computing scripts, and is no longer bound by the implementation of engineering efficiency.

About Apache Pulsar

Apache Pulsar is the next-generation cloud-native message flow platform. It graduated in September 2018 and became the top project of the Apache Software Foundation. Since its birth in 2012, Apache Pulsar has prospectively adopted the cloud-native architecture of separation of storage and computing, layering, and fragmentation, which greatly reduces the difficulties of expansion and operation and maintenance encountered by users in the message system.

Through special design and abstraction, Pulsar uniformly supports two message consumption modes: Stream and Queue, maintaining the high performance of Stream mode and the flexibility of Queue mode. While ensuring the performance and throughput of the big data message system, Pulsar provides more enterprise-level Features, including convenient operation maintenance and expansion, flexible message model, multi-language API, multi-tenancy, remote multi standby, strong persistence and consistency of data, and solves many shortcomings of the existing open-source message system. At the same time, this design is very container friendly, making Pulsar an ideal choice for streaming native platforms.

OpenMLDB-Pulsar Connector

[Connector Summary]

Orientation

OpenMLDB Pulsar Connector can efficiently connect real-time data to feature engineering, greatly improve data utilization efficiency, help developers build real-time data pipelines, and make enterprises more focused and efficient to explore the commercial value of data.

In the workflow of Pulsar in OpenMLDB, the Connector (as shown in the figure below) helps developers easily combine the message system Pulsar with the open-source machine learning database OpenMLDB, and play the most powerful real-time value of Pulsar for machine learning.

Functions

  • Pulsar can use the connector to connect other systems. The Source connector allows data from other systems to flow into the Pulsar, and the sink connector allows messages to flow out to other systems.

  • Pulsar connectors (for example, create, update, start, stop, restart, reload, delete, and other operations) can be managed through the Connector Admin CLI in conjunction with the sources and sinks subcommands.

  • The JDBC OpenMLDB Connector supports the sink function so that Pulsar messages can be written to the OpenMLDB online storage.

The Advantage

To ensure stable streaming integration between OpenMLDB and Pulsar, the OpenMLDB connector has many advantages, including but not limited to:

  • Easy to use. You do not need to write any code. You only need to make a simple configuration to flow Pulsar messages into OpenMLDB through the OpenMLDB Pulsar Connector. The simplified data import process can greatly improve the data utilization efficiency of enterprises.

  • Easy to expand. According to different business needs, you can choose to run the OpenMLDB Pulsar Connector on a standalone or cluster to help enterprises build real-time data pipelines.

  • Sustainability. The simple installation and deployment process of OpenMLDB Pulsar Connector enables enterprises to explore the commercial value of data more intensively and efficiently.

Connector Downloading Address

OpenMLDB Pulsar Connector: https://github.com/4paradigm/OpenMLDB/releases/download/v0.4.4/pulsar-io-jdbc-openmldb-2.11.0-SNAPSHOT.nar

[Connector Demonstration]

Introduction Process

  • Before creating the connector, you need to start the OpenMLDB cluster and create tables.

  • Create Pulsar standalone and sink. The JDBC address of the OpenMLDB cluster is used in the sink configuration. Also, create a schema for parsing messages.

  • Send a message to Pulsar to test whether the message can be automatically written to OpenMLDB.

Key Steps

Only the key steps of using this connector are listed below. Please refer to The Use of Pulsar OpenMLDB Connector

Step 1 | Start OpenMLDB and create tables

Start the OpenMLDB cluster using docker or locally and create a table as the address imported by the connector. As shown in the figure, in the pulsar_test database, a table connector consistent with the schema and the taxi trip demo consistent table connector_test.

Step 2 | Start Pulsar,Create sink and schema

After starting the Pulsar standalone successfully, you can use the connector to create a sink. If the connector is preloaded into the Pulsar, you can use the sink type of “jdbc-openmldb” or use “archive” to specify the connector NAR file path. The latter is used in the presentation to create. The table address of OpenMLDB needs to be specified in the configuration of creating sink, so the configuration is as follows:

Since the data in the OpenMLDB table has a schema, you will need to configure the schema for “test_openmldb” in the pulsar to successfully parse the message into data conforming to the OpenMLDB schema and insert it.

We configure JSON Schema for “test_openmldb”, and the column properties of each column are consistent with those of OpenMLDB. The configuration file is as follows:

After both sink and schema are created, as long as a message is sent to the topic “test_openmldb” (the “inputs” configuration item of the sink configuration diagram), it will be automatically written to the OpenMLDB cluster.

Step 3 | Test

After the first two steps are completed, the test can be carried out. The key codes of Producer for testing are as follows:

We use the Producer program to write two JSON Messages to Pulsar. Then, we can query these two pieces of data in the OpenMLDB online storage. This shows that the connector operates normally and can automatically sink the messages flowing into the Pulsar into the OpenMLDB online storage.

Last

OpenMLDB Upstream and Downstream Ecosystem

To better reduce the threshold for developers to use OpenMLDB, the OpenMLDB community will continue to build an ecosystem for upstream and downstream technology components and provide developers with more simple and easy-to-use ecological Connectors (as shown in the figure below):

  • Oriented in online data ecology, such as Pulsar, Kafka, Flink, RabbitMQ, RocketMQ, etc

  • For offline data ecology, such as HDFS, HBase, Cassandra, S3, etc

  • Model building oriented algorithms and frameworks, such as XGBoost, LightGBM, TensorFlow, PyTorch, Scikit Learn, etc

  • Scheduling framework and deployment tools for the whole process of machine learning modeling, such as Airflow, Kubeflow, Dolphin Scheduler, Prometheus, Grafana, etc

OpenMLDB Roadmap

v0.5.0

The OpenMLDB community will release version v0.5.0 ([dev] openmldb v0.5 roadmap · issue \1506 · 4paradigm/openmldb · GitHub) at the end of April. At that time, OpenMLDB will have the following new features:

  • Window pre-aggregation technology to exponentially improve the performance of long window aggregation

  • Perfect monitoring, trace, and profiling capabilities greatly improve the stability, observability, and analyzability in the enterprise application environment

  • The online storage engine can be pluggable to meet different business requirements. It can support either a memory-based high-performance external storage engine, a large-capacity low-cost storage engine based on external memory, or a persistent memory-based storage engine to maintain a balance between performance and cost

  • User-defined function (UDF) support, greatly improving ease of use and applicability

  • Ecological integration of upstream and downstream data sources, providing Kafka and Pulsar connectors for online data sources

The progress of AI requires many efforts, and open collaboration is a key link. We look forward to contributions from developers. Welcome to the OpenMLDB community. Scan the QR code below to join the community technical exchange WeChat group.

Or you can find the group entry channel at the bottom of the community section of the official website https://openmldb.ai/community/

Related Articles

https://github.com/4paradigm/OpenMLDB/issues/1506

(OpenMLDB Pulsar Connector)

https://openmldb.ai/docs/zh/v0.4/about/index.html

(OpenMLDB Document)

Built-in connector · Apache Pulsar

(Apache Pulsar connector Document,OpenMLDB Pulsar Connector position as shown in the figure)