Taxi Trip Duration Prediction (OpenMLDB+LightGBM)
Contents
Taxi Trip Duration Prediction (OpenMLDB+LightGBM)#
This article will use New York City Taxi Trip Duration from Kaggle as an example to demonstrate how to use the combination of OpenMLDB and LightGBM to create a complete machine-learning application.
Please note that this document employs a pre-compiled Docker image. If you wish to perform tests in your self-compiled and built OpenMLDB environment, you will need to configure and utilize the Spark Distribution Documentation for Feature Engineering Optimization. Refer to the Spark Distribution Documentation for OpenMLDB Optimization and the Installation and Deployment Documentation for more detailed information.
Preparation and Preliminary Knowledge#
This article is centered around the development and deployment of OpenMLDB CLI. To begin, you should download the sample data and initiate the OpenMLDB CLI. We recommend using Docker images for a streamlined experience.
Docker version: >= 18.03
Pull Image#
Execute the following command from the command line to pull the OpenMLDB image and start the Docker container:
docker run -it 4pdosc/openmldb:0.8.5 bash
This image comes pre-installed with OpenMLDB and encompasses all the scripts, third-party libraries, open-source tools, and training data necessary for this case.
Note
Keep in mind that the demonstration commands in the OpenMLDB section of this tutorial are executed by default within the Docker container that has been started.
Initialize Environment#
./init.sh
cd taxi-trip
The init.sh script provided within the image helps users initialize the environment quickly, including:
Configuring Zookeeper
Starting the cluster version of OpenMLDB
Start OpenMLDB CLI#
/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client
Preliminary Knowledge: Asynchronous Tasks#
Some OpenMLDB commands are asynchronous, such as the LOAD DATA
, SELECT
, and SELECT INTO
commands in online/offline mode. After submitting a task, you can use relevant commands such as SHOW JOBS
and SHOW JOB
to view the progress of the task. For details, please refer to the Offline Task Management Document.
Machine Learning Process#
Step 1: Create Database and Table#
Create database demo_db
and table t1
:
--OpenMLDB CLI
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
Step 2: Import Offline Data#
First, switch to offline execution mode. Then, import the sample data /work/taxi-trip/data/taxi_tour_table_train_simple.snappy.parquet
as offline data for offline feature computation.
--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
LOAD DATA INFILE '/work/taxi-trip/data/taxi_tour_table_train_simple.snappy.parquet' INTO TABLE t1 options(format='parquet', header=true, mode='append');
Note
LOAD DATA
is an asynchronous task. Please use the SHOW JOBS
command to check the task’s running status. Wait for the task to run successfully (from state
transitions to FINISHED
status) before proceeding to the next step.
Step 3: Feature Design#
Typically, before designing features, users need to analyze the data based on machine learning goals and then design and explore features based on this analysis. However, this article does not cover data analysis and feature research in machine learning. It assumes that users have a basic understanding of machine learning theory, the ability to solve machine learning problems, proficiency in SQL syntax, and the capacity to construct features using SQL syntax. In this case, it is assumed that the user has designed the following features through analysis and research:
Feature Name |
Characteristic Meaning |
SQL Feature Representation |
---|---|---|
trip_duration |
Travel time for a single trip |
|
passenger_count |
Number of passengers |
|
vendor_sum_pl |
The cumulative pickup_latitude for taxis of the same brand within the last 1-day time window. |
m(pickup_latitude) OVER w` |
vendor_max_pl |
The maximum pickup_latitude for taxis of the same brand within the last 1-day time window |
|
vendor_min_pl |
The minimum pickup_latitude for taxis of the same brand within the last 1-day time window |
|
vendor_avg_pl |
The average pickup_latitude for taxis of the same brand within the last 1-day time window |
|
pc_sum_pl |
The cumulative pickup_latitude for trips with the same passenger count within the last 1-day time window. |
|
pc_max_pl |
The maximum pickup_latitude for trips with the same passenger count within the last 1-day time window. |
|
pc_min_pl |
The minimum pickup_latitude for trips with the same passenger count within the last 1-day time window. |
|
pc_avg_pl |
The average pickup_latitude for trips with the same passenger count within the last 1-day time window. |
|
pc_cnt |
The total number of trips with the same passenger capacity within the last 1-day time window |
|
vendor_cnt |
The total number of trips for taxis of the same brand within the last day’s time window |
|
In the actual process of machine learning feature research, scientists conduct repeated experiments on features to find the best feature set for the model. So the process of “feature design -> offline feature extraction -> model training” will be repeated multiple times, and the features will be continuously adjusted to achieve the desired outcome.
Step 4: Offline Feature Extraction#
Users perform feature extraction in offline mode and save the feature results in the /tmp/feature_data
directory for future model training. The SELECT
command corresponds to the SQL feature computation script generated based on the above feature design.
--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='offline';
SELECT trip_duration, passenger_count,
sum(pickup_latitude) OVER w AS vendor_sum_pl,
max(pickup_latitude) OVER w AS vendor_max_pl,
min(pickup_latitude) OVER w AS vendor_min_pl,
avg(pickup_latitude) OVER w AS vendor_avg_pl,
sum(pickup_latitude) OVER w2 AS pc_sum_pl,
max(pickup_latitude) OVER w2 AS pc_max_pl,
min(pickup_latitude) OVER w2 AS pc_min_pl,
avg(pickup_latitude) OVER w2 AS pc_avg_pl,
count(vendor_id) OVER w2 AS pc_cnt,
count(vendor_id) OVER w AS vendor_cnt
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/tmp/feature_data';
Note
SELECT INTO
is an asynchronous task. Please use the SHOW JOBS command to check the task’s running status and wait for it to run successfully (transitioning to the FINISHED state) before proceeding to the next step.
Step 5: Model Training#
Model training is not completed within OpenMLDB. Therefore, exit the OpenMLDB CLI first using the
quit
command.quit;
On the regular command line, run
train.py
(located in the/work/taxi-trip
directory) and use the open-source training toolLightGBM
to train the model based on the offline feature table generated in the previous step. The training results are saved in/tmp/model.txt
.python3 train.py /tmp/feature_data /tmp/model.txt
Step 6: Launch Feature Extraction SQL Script#
Assuming that the features designed in Step 3: Feature Design have produced the expected model in the previous training, the next step is to deploy the feature extraction SQL script online to provide online feature extraction services.
Restart the OpenMLDB CLI for SQL online deployment:
/work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client
Execute online deployment:
--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='online';
DEPLOY demo SELECT trip_duration, passenger_count,
sum(pickup_latitude) OVER w AS vendor_sum_pl,
max(pickup_latitude) OVER w AS vendor_max_pl,
min(pickup_latitude) OVER w AS vendor_min_pl,
avg(pickup_latitude) OVER w AS vendor_avg_pl,
sum(pickup_latitude) OVER w2 AS pc_sum_pl,
max(pickup_latitude) OVER w2 AS pc_max_pl,
min(pickup_latitude) OVER w2 AS pc_min_pl,
avg(pickup_latitude) OVER w2 AS pc_avg_pl,
count(vendor_id) OVER w2 AS pc_cnt,
count(vendor_id) OVER w AS vendor_cnt
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW);
DEPLOY contains BIAS OPTIONS here because importing data files into online storage will not be updated. For the current time, it may exceed the TTL (Time-To-Live) of the table index after DEPLOY, leading to the expiration of this data in the table. Time-based expiration relies solely on the 'ts' column and 'ttl' of each index. If the value in that column of the data is < (current time - abs_ttl), it will be expired for that index, irrespective of other factors. The different indexes also do not influence each other. If your data does not generate real-time new timestamps, you also need to consider including BIAS OPTIONS.
Step 7: Import Online Data#
Firstly, switch to online execution mode. Then, in online mode, import the sample data from /work/taxi-trip/data/taxi_tour_table_train_simple.csv
as online data for online feature computation.
--OpenMLDB CLI
USE demo_db;
SET @@execute_mode='online';
LOAD DATA INFILE 'file:///work/taxi-trip/data/taxi_tour_table_train_simple.csv' INTO TABLE t1 options(format='csv', header=true, mode='append');
Note
LOAD DATA
is an asynchronous task. Please use the SHOW JOBS command to monitor the task’s progress and wait for it to successfully complete (transition to the FINISHED state) before proceeding to the next step.
Step 8: Start Prediction Service#
If you have not already exited the OpenMLDB CLI, exit the OpenMLDB CLI first.
quit;
Start the estimation service on the regular command line:
./start_predict_server.sh 127.0.0.1:9080 /tmp/model.txt
Step 9: Send Prediction Request#
Execute the built-in predict.py
script from the regular command line. This script sends a request data line to the prediction service, receives the estimation results in return, and prints them out.
# Run inference with a HTTP request
python3 predict.py
# The following output is expected (the numbers might be slightly different)
----------------ins---------------
[[ 2. 40.774097 40.774097 40.774097 40.774097 40.774097 40.774097
40.774097 40.774097 1. 1. ]]
---------------predict trip_duration -------------
848.014745715936 s