Amazon S3#

Introduction#

Amazon S3 is an object storage service on the AWS cloud. OpenMLDB supports the use of S3 as an offline storage engine for reading and exporting feature computation data.

Configuration#

Configure Taskmanager#

To read and write data from Amazon S3, you first need to register an account on the AWS cloud service and apply for a usable AccessKey and SecretKey.

Then add the following configuration in the configuration file taskmanager.properties of TaskManager, and note that you need to modify the AccessKey and SecretKey to the key requested above.

spark.default.conf=spark.hadoop.fs.s3a.access.key=xxx;spark.hadoop.fs.s3a.secret.key=xxx

Usage#

Import Data#

If you want to import S3 data, after the TaskManager is successfully configured and restarted, you can directly execute the following command to import the corresponding bucket and path data.

LOAD DATA INFILE 's3a://bucket/path/' INTO TABLE t1 OPTIONS (header=true, mode='append', deep_copy=true);

S3 data source supports deep_copy with two import methods: true and false. Only when deep_copy=true will the data be copied from S3 to OpenMLDB offline storage.

Please note that, only overwrite mode can be used when deep_copy is set to true. If there is already a soft link imported earlier, it will be cleared. If you want to have both hard copies and soft links, you can only make hard copies first (set deep_copy to true), and then make soft links (set deep_copy to false). In addition, since the data format is parquet, it is required that the soft links to the data are also parquet data with the same schema.