LOAD DATA INFILE
Contents
LOAD DATA INFILE#
The LOAD DATA INFILE
statement load data efficiently from a file to a table. LOAD DATA INFILE
is complementary to SELECT ... INTO OUTFILE
. To export data from a table to a file, use SELECT…INTO OUTFILE.
Syntax#
LoadDataInfileStmt
::= 'LOAD' 'DATA' 'INFILE' filePath 'INTO' 'TABLE' tableName LoadDataInfileOptionsList
filePath
::= URI
tableName
::= string_literal
LoadDataInfileOptionsList
::= 'OPTIONS' '(' LoadDataInfileOptionItem (',' LoadDataInfileOptionItem)* ')'
LoadDataInfileOptionItem
::= 'DELIMITER' '=' string_literal
|'HEADER' '=' bool_literal
|'NULL_VALUE' '=' string_literal
|'FORMAT' '=' string_literal
|'QUOTE' '=' string_literal
|'MODE' '=' string_literal
|'DEEP_COPY' '=' bool_literal
|'LOAD_MODE' '=' string_literal
|'THREAD' '=' int_literal
URI
::= 'file://FilePathPattern'
|'hdfs://FilePathPattern'
|'hive://[db.]table'
|'FilePathPattern'
FilePathPattern
::= string_literal
The FilePathPattern
supports wildcard character *
, with the same match rules as ls FilePathPattern
.
Supports loading data from Hive, but needs extra settings, see Hive Support.
The following table introduces the parameters of LOAD DATA INFILE
.
Parameter |
Type |
Default Value |
Note |
---|---|---|---|
delimiter |
String |
, |
It defines the column separator, the default value is |
header |
Boolean |
true |
It indicates that whether the table to import has a header. If the value is |
null_value |
String |
null |
It defines the string that will be used to replace the |
format |
String |
csv |
It defines the format of the input file. |
quote |
String |
“ |
It defines the string surrounding the input data. The string length should be <= 1. |
mode |
String |
“error_if_exists” |
It defines the input mode. |
deep_copy |
Boolean |
true |
It defines whether |
load_mode |
String |
cluster |
|
thread |
Integer |
1 |
It only works for data loading locally, i.e., |
Note
In the cluster version, the specified execution mode (defined by
execute_mode
) determines whether to import data to online or offline storage when theLOAD DATA INFILE
statement is executed. For the standalone version, there is no difference in storage mode and thedeep_copy
option is not supported.As metioned in the above table, online execution mode only supports append input mode.
When
deep_copy=false
, OpenMLDB doesn’t support to modify the data in the soft link. Therefore, if the current offline data comes from a soft link,append
import is no longer supported. Moreover, if current connection is soft copy, using the hard copy withoverwrite
will not delete the data of the soft connection.
Warning
INFILE Path
In the cluster version,if load_mode='cluster'
,the reading of the INFILE
path is done by a batch job. If it is a relative path, it needs to be an accessible path. However, in a production environment, the execution of batch jobs is usually scheduled by a yarn cluster. As a result, it is not deterministic that which batch job will actually perform the task. In a testing environment, if it’s multi-machine deployment, it is also unable to determine where the batch job is running.
Therefore, you are suggested to use absolute paths. In the stand-alone version, the local file path starts with file://
. In the production environment, it is recommended to use a file system such as HDFS.
SQL Statement Template#
LOAD DATA INFILE 'file_name' INTO TABLE 'table_name' OPTIONS (key = value, ...);
Example#
The following sql example imports data from a file data.csv
into a table t1
using online storage. data.csv
uses ,
as the column separator.
set @@execute_mode='online';
LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',' );
The following SQL example imports data from file data.csv
into table t1
. data.csv
uses ,
as the column delimiter. The null value will be replaced by a string “NA”.
LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',', null_value='NA');
The following example shows an example of soft copy.
set @@execute_mode='offline';
LOAD DATA INFILE 'data_path' INTO TABLE t1 OPTIONS(deep_copy=false);
Hive Support#
Hive Data Format#
We support the Hive data format below,others are unsupported(e.g. Binary).
OpenMLDB Data Format |
Hive Data Format |
---|---|
BOOL |
BOOLEAN |
SMALLINT |
SMALLINT |
INT |
INT |
BIGINT |
BIGINT |
FLOAT |
FLOAT |
DOUBLE |
DOUBLE |
DATE |
DATE |
TIMESTAMP |
TIMESTAMP |
Enable Hive Support#
To support Hive, we need Hive dependencies and Hive conf.
Hive dependencies in Spark#
OpenMLDB Spark v0.6.7 and above have the Hive dependencies. If you use other Spark release, you should build Hive dependencies in spark, dependencies are in assembly/target/scala-xx/jars
. Add them to Spark class path.
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
Hive Conf#
We support connect Hive by metastore service.
spark.conf
You can set
spark.hadoop.Hive.metastore.uris
in Spark conf.taskmanager.properties: add
spark.hadoop.Hive.metastore.uris=thrift://...
inspark.default.conf
, then restart the taskmanagerCLI: add it in ini conf, use
--spark_conf
to start the CLI, ref Spark Client Configuration.
Hive-site.xml:
You can set
Hive.metastore.uris
inHive-site.xml
and add it to Spark homeconf/
.The
Hive-site.xml
example:<configuration> <property> <name>Hive.metastore.uris</name> <!--Make sure that <value> points to the Hive Metastore URI in your cluster --> <value>thrift://localhost:9083</value> <description>URI for client to contact metastore server</description> </property> </configuration>
CSV Source Data Format#
We support csv and parquet,but be careful with the csv format. Here is an example.
c1, c2
,
"",""
ab,cd
"ef","gh"
null,null
The first row in csv data is two blank values.
cluster mode:
null
, whatever the null_value is.local mode: empty string, see issue3015。
The second row in csv data is two double quotation marks.
cluster mode: default quote is
"
, so it’s two empty strings.local mode: default quote is
\0
, so it’s two strings of two double quotation marks. You can set quote to"
in local mode,but escape rule is different with Spark,""
means"
,seeissue3015.