LOAD DATA INFILE
Contents
LOAD DATA INFILE#
The LOAD DATA INFILE statement load data efficiently from a file to a table. LOAD DATA INFILE is complementary to SELECT ... INTO OUTFILE. To export data from a table to a file, use SELECT…INTO OUTFILE.
Syntax#
LoadDataInfileStmt
::= 'LOAD' 'DATA' 'INFILE' filePath 'INTO' 'TABLE' tableName LoadDataInfileOptionsList
filePath
::= URI
tableName
::= string_literal
LoadDataInfileOptionsList
::= 'OPTIONS' '(' LoadDataInfileOptionItem (',' LoadDataInfileOptionItem)* ')'
LoadDataInfileOptionItem
::= 'DELIMITER' '=' string_literal
|'HEADER' '=' bool_literal
|'NULL_VALUE' '=' string_literal
|'FORMAT' '=' string_literal
|'QUOTE' '=' string_literal
|'MODE' '=' string_literal
|'DEEP_COPY' '=' bool_literal
|'LOAD_MODE' '=' string_literal
|'THREAD' '=' int_literal
URI
::= 'file://FilePathPattern'
|'hdfs://FilePathPattern'
|'hive://[db.]table'
|'FilePathPattern'
FilePathPattern
::= string_literal
The FilePathPattern supports wildcard character *, with the same match rules as ls FilePathPattern.
Supports loading data from Hive, but needs extra settings, see Hive Support.
The following table introduces the parameters of LOAD DATA INFILE.
Parameter |
Type |
Default Value |
Note |
|---|---|---|---|
delimiter |
String |
, |
It defines the column separator, the default value is |
header |
Boolean |
true |
It indicates that whether the table to import has a header. If the value is |
null_value |
String |
null |
It defines the string that will be used to replace the |
format |
String |
csv |
It defines the format of the input file. |
quote |
String |
“” |
It defines the string surrounding the input data. The string length should be <= 1. The default is “”, which means that the string surrounding the input data is empty. When the surrounding string is configured, the content surrounded by a pair of the quote characters will be parsed as a whole. For example, if the surrounding string is |
mode |
String |
“error_if_exists” |
It defines the input mode. |
deep_copy |
Boolean |
true |
It defines whether |
load_mode |
String |
cluster |
|
thread |
Integer |
1 |
It only works for data loading locally, i.e., |
Note
In the cluster version, the specified execution mode (defined by
execute_mode) determines whether to import data to online or offline storage when theLOAD DATA INFILEstatement is executed. For the standalone version, there is no difference in storage mode and thedeep_copyoption is not supported.As metioned in the above table, online execution mode only supports append input mode.
When
deep_copy=false, OpenMLDB doesn’t support to modify the data in the soft link. Therefore, if the current offline data comes from a soft link,appendimport is no longer supported. Moreover, if current connection is soft copy, using the hard copy withoverwritewill not delete the data of the soft connection.
Warning
INFILE Path
In the cluster version,if load_mode='cluster',the reading of the INFILE path is done by a batch job. If it is a relative path, it needs to be an accessible path. However, in a production environment, the execution of batch jobs is usually scheduled by a yarn cluster. As a result, it is not deterministic that which batch job will actually perform the task. In a testing environment, if it’s multi-machine deployment, it is also unable to determine where the batch job is running.
Therefore, you are suggested to use absolute paths. In the stand-alone version, the local file path starts with file://. In the production environment, it is recommended to use a file system such as HDFS.
SQL Statement Template#
LOAD DATA INFILE 'file_name' INTO TABLE 'table_name' OPTIONS (key = value, ...);
Example#
The following sql example imports data from a file data.csv into a table t1 using online storage. data.csv uses , as the column separator.
set @@execute_mode='online';
LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',' );
The following SQL example imports data from file data.csv into table t1. data.csv uses , as the column delimiter. The null value will be replaced by a string “NA”.
LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',', null_value='NA');
The following example shows an example of soft copy.
set @@execute_mode='offline';
LOAD DATA INFILE 'data_path' INTO TABLE t1 OPTIONS(deep_copy=false);
Hive Support#
Hive Data Format#
We support the Hive data format below,others are unsupported(e.g. Binary).
OpenMLDB Data Format |
Hive Data Format |
|---|---|
BOOL |
BOOLEAN |
SMALLINT |
SMALLINT |
INT |
INT |
BIGINT |
BIGINT |
FLOAT |
FLOAT |
DOUBLE |
DOUBLE |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP |
Enable Hive Support#
To support Hive, we need Hive dependencies and Hive conf.
Hive dependencies in Spark#
OpenMLDB Spark v0.6.7 and above have the Hive dependencies. If you use other Spark release, you should build Hive dependencies in spark, dependencies are in assembly/target/scala-xx/jars. Add them to Spark class path.
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
Hive Conf#
We support connect Hive by metastore service.
spark.conf
You can set
spark.hadoop.Hive.metastore.urisin Spark conf.taskmanager.properties: add
spark.hadoop.Hive.metastore.uris=thrift://...inspark.default.conf, then restart the taskmanagerCLI: add it in ini conf, use
--spark_confto start the CLI, ref Spark Client Configuration.
Hive-site.xml:
You can set
Hive.metastore.urisinHive-site.xmland add it to Spark homeconf/.The
Hive-site.xmlexample:<configuration> <property> <name>Hive.metastore.uris</name> <!--Make sure that <value> points to the Hive Metastore URI in your cluster --> <value>thrift://localhost:9083</value> <description>URI for client to contact metastore server</description> </property> </configuration>