LOAD DATA INFILE#

The LOAD DATA INFILE statement load data efficiently from a file to a table. LOAD DATA INFILE is complementary to SELECT ... INTO OUTFILE. To export data from a table to a file, use SELECT…INTO OUTFILE.

Syntax#

LoadDataInfileStmt
				::= 'LOAD' 'DATA' 'INFILE' filePath 'INTO' 'TABLE' tableName LoadDataInfileOptionsList

filePath 
				::= URI
				    
tableName
				::= string_literal

LoadDataInfileOptionsList
				::= 'OPTIONS' '(' LoadDataInfileOptionItem (',' LoadDataInfileOptionItem)* ')'

LoadDataInfileOptionItem
				::= 'DELIMITER' '=' string_literal
				|'HEADER' '=' bool_literal
				|'NULL_VALUE' '=' string_literal
				|'FORMAT' '=' string_literal						
				|'QUOTE' '=' string_literal
				|'MODE' '=' string_literal
				|'DEEP_COPY' '=' bool_literal
				|'LOAD_MODE' '=' string_literal
				|'THREAD' '=' int_literal

URI
				::= 'file://FilePathPattern'
				|'hdfs://FilePathPattern'
				|'hive://[db.]table'
				|'FilePathPattern'

FilePathPattern
				::= string_literal

The FilePathPattern supports wildcard character *, with the same match rules as ls FilePathPattern.

Supports loading data from Hive, but needs extra settings, see Hive Support.

The following table introduces the parameters of LOAD DATA INFILE.

Parameter	Type	Default Value	Note
delimiter	String	,	It defines the column separator, the default value is `,`.
header	Boolean	true	It indicates that whether the table to import has a header. If the value is `true`, the table has a header.
null_value	String	null	It defines the string that will be used to replace the `NULL` value when loading data.
format	String	csv	It defines the format of the input file. `csv` is the default format. `parquet` format is supported in the cluster version.
quote	String	“	It defines the string surrounding the input data. The string length should be <= 1. load_mode=‘cluster’: default is `"`, the content surrounded by a pair of the quote characters will be parsed as a whole. For example, if the surrounding string is `"#"` then the original data like `1, 1.0, #This is a string, with comma#, normal_string` will be converted to four fields. The first field is an integer 1, the second is a float 1.0, the third field is a string “This is a string, with comma” and the 4th is “normal_string” even it’s no quote. load_mode=‘local’: default is `\0`, which means that the string surrounding the input data is empty.
mode	String	“error_if_exists”	It defines the input mode. `error_if_exists` is the default mode which indicates that an error will be thrown out if the offline table already has data. This input mode is only supported by the offline execution mode. `overwrite` indicates that if the file already exists, the data will overwrite the contents of the original file. This input mode is only supported by the offline execution mode. `append` indicates that if the table already exists, the data will be appended to the original table. Both offline and online execution modes support this input mode.
deep_copy	Boolean	true	It defines whether `deep_copy` is used. Only offline load supports `deep_copy=false`, you can specify the `INFILE` path as the offline storage address of the table to avoid hard copy.
load_mode	String	cluster	`load_mode='local'` only supports loading the `csv` local files into the `online` storage; It loads the data synchronously by the client process. `load_mode='cluster'` only supports the cluster version. It loads the data via Spark synchronously or asynchronously.
thread	Integer	1	It only works for data loading locally, i.e., `load_mode='local'` or in the standalone version; It defines the number of threads used for data loading. The max value is `50`.

Note

In the cluster version, the specified execution mode (defined by execute_mode) determines whether to import data to online or offline storage when the LOAD DATA INFILE statement is executed. For the standalone version, there is no difference in storage mode and the deep_copy option is not supported.
As metioned in the above table, online execution mode only supports append input mode.
When deep_copy=false, OpenMLDB doesn’t support to modify the data in the soft link. Therefore, if the current offline data comes from a soft link, append import is no longer supported. Moreover, if current connection is soft copy, using the hard copy with overwrite will not delete the data of the soft connection.

Warning

INFILE Path

In the cluster version，if load_mode='cluster'，the reading of the INFILE path is done by a batch job. If it is a relative path, it needs to be an accessible path. However, in a production environment, the execution of batch jobs is usually scheduled by a yarn cluster. As a result, it is not deterministic that which batch job will actually perform the task. In a testing environment, if it’s multi-machine deployment, it is also unable to determine where the batch job is running.

Therefore, you are suggested to use absolute paths. In the stand-alone version, the local file path starts with file://. In the production environment, it is recommended to use a file system such as HDFS.

SQL Statement Template#

LOAD DATA INFILE 'file_name' INTO TABLE 'table_name' OPTIONS (key = value, ...);

Example#

The following sql example imports data from a file data.csv into a table t1 using online storage. data.csv uses , as the column separator.

set @@execute_mode='online';
LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',' );

The following SQL example imports data from file data.csv into table t1. data.csv uses , as the column delimiter. The null value will be replaced by a string “NA”.

LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',', null_value='NA');

The following example shows an example of soft copy.

set @@execute_mode='offline';
LOAD DATA INFILE 'data_path' INTO TABLE t1 OPTIONS(deep_copy=false);

Hive Support#

Hive Data Format#

We support the Hive data format below，others are unsupported(e.g. Binary).

OpenMLDB Data Format	Hive Data Format
BOOL	BOOLEAN
SMALLINT	SMALLINT
INT	INT
BIGINT	BIGINT
FLOAT	FLOAT
DOUBLE	DOUBLE
DATE	DATE
TIMESTAMP	TIMESTAMP

Enable Hive Support#

To support Hive, we need Hive dependencies and Hive conf.

Hive dependencies in Spark#

OpenMLDB Spark v0.6.7 and above have the Hive dependencies. If you use other Spark release, you should build Hive dependencies in spark, dependencies are in assembly/target/scala-xx/jars. Add them to Spark class path.

./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package

Hive Conf#

We support connect Hive by metastore service.

spark.conf

You can set spark.hadoop.Hive.metastore.uris in Spark conf.
- taskmanager.properties: add spark.hadoop.Hive.metastore.uris=thrift://... in spark.default.conf, then restart the taskmanager
- CLI: add it in ini conf, use --spark_conf to start the CLI, ref Spark Client Configuration.

Hive-site.xml:

You can set Hive.metastore.uris in Hive-site.xml and add it to Spark home conf/.

The Hive-site.xml example:

<configuration>
<property>
	<name>Hive.metastore.uris</name>
	<!--Make sure that <value> points to the Hive Metastore URI in your cluster -->
	<value>thrift://localhost:9083</value>
	<description>URI for client to contact metastore server</description>
</property>
</configuration>

CSV Source Data Format#

We support csv and parquet，but be careful with the csv format. Here is an example.

c1, c2
,
"",""
ab,cd
"ef","gh"
null,null

The first row in csv data is two blank values.

cluster mode: null, whatever the null_value is.
local mode: empty string, see issue3015。

The second row in csv data is two double quotation marks.

cluster mode: default quote is", so it’s two empty strings.
local mode: default quote is\0, so it’s two strings of two double quotation marks. You can set quote to " in local mode，but escape rule is different with Spark, "" means "，seeissue3015.

LOAD DATA INFILE

Contents