# LOAD DATA INFILE The `LOAD DATA INFILE` statement load data efficiently from a file to a table. `LOAD DATA INFILE` is complementary to `SELECT ... INTO OUTFILE`. To export data from a table to a file, use [SELECT...INTO OUTFILE](../dql/SELECT_INTO_STATEMENT.md). ## Syntax ```sql LoadDataInfileStmt ::= 'LOAD' 'DATA' 'INFILE' filePath 'INTO' 'TABLE' tableName LoadDataInfileOptionsList filePath ::= URI tableName ::= string_literal LoadDataInfileOptionsList ::= 'OPTIONS' '(' LoadDataInfileOptionItem (',' LoadDataInfileOptionItem)* ')' LoadDataInfileOptionItem ::= 'DELIMITER' '=' string_literal |'HEADER' '=' bool_literal |'NULL_VALUE' '=' string_literal |'FORMAT' '=' string_literal |'QUOTE' '=' string_literal |'MODE' '=' string_literal |'DEEP_COPY' '=' bool_literal |'LOAD_MODE' '=' string_literal |'THREAD' '=' int_literal URI ::= 'file://FilePathPattern' |'hdfs://FilePathPattern' |'hive://[db.]table' |'FilePathPattern' FilePathPattern ::= string_literal ``` The `FilePathPattern` supports wildcard character `*`, with the same match rules as `ls FilePathPattern`. Supports loading data from Hive, but needs extra settings, see [Hive Support](#Hive-support). The following table introduces the parameters of `LOAD DATA INFILE`. | Parameter | Type | Default Value | Note | | ---------- | ------- | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | delimiter | String | , | It defines the column separator, the default value is `,`. | | header | Boolean | true | It indicates that whether the table to import has a header. If the value is `true`, the table has a header. | | null_value | String | null | It defines the string that will be used to replace the `NULL` value when loading data. | | format | String | csv | It defines the format of the input file.
`csv` is the default format.
`parquet` format is supported in the cluster version. | | quote | String | " | It defines the string surrounding the input data. The string length should be <= 1.
load_mode='cluster': default is `"`, the content surrounded by a pair of the quote characters will be parsed as a whole. For example, if the surrounding string is `"#"` then the original data like `1, 1.0, #This is a string, with comma#, normal_string` will be converted to four fields. The first field is an integer 1, the second is a float 1.0, the third field is a string "This is a string, with comma" and the 4th is "normal_string" even it's no quote.
load_mode='local': default is `\0`, which means that the string surrounding the input data is empty. | | mode | String | "error_if_exists" | It defines the input mode.
`error_if_exists` is the default mode which indicates that an error will be thrown out if the offline table already has data. This input mode is only supported by the offline execution mode.
`overwrite` indicates that if the file already exists, the data will overwrite the contents of the original file. This input mode is only supported by the offline execution mode.
`append` indicates that if the table already exists, the data will be appended to the original table. Both offline and online execution modes support this input mode. | | deep_copy | Boolean | true | It defines whether `deep_copy` is used. Only offline load supports `deep_copy=false`, you can specify the `INFILE` path as the offline storage address of the table to avoid hard copy. | | load_mode | String | cluster | `load_mode='local'` only supports loading the `csv` local files into the `online` storage; It loads the data synchronously by the client process.
`load_mode='cluster'` only supports the cluster version. It loads the data via Spark synchronously or asynchronously. | | thread | Integer | 1 | It only works for data loading locally, i.e., `load_mode='local'` or in the standalone version; It defines the number of threads used for data loading. The max value is `50`. | ```{note} - In the cluster version, the specified execution mode (defined by `execute_mode`) determines whether to import data to online or offline storage when the `LOAD DATA INFILE` statement is executed. For the standalone version, there is no difference in storage mode and the `deep_copy` option is not supported. - As metioned in the above table, online execution mode only supports append input mode. - When `deep_copy=false`, OpenMLDB doesn't support to modify the data in the soft link. Therefore, if the current offline data comes from a soft link, `append` import is no longer supported. Moreover, if current connection is soft copy, using the hard copy with `overwrite` will not delete the data of the soft connection. ``` ```{warning} INFILE Path :class: warning In the cluster version,if `load_mode='cluster'`,the reading of the `INFILE` path is done by a batch job. If it is a relative path, it needs to be an accessible path. However, in a production environment, the execution of batch jobs is usually scheduled by a yarn cluster. As a result, it is not deterministic that which batch job will actually perform the task. In a testing environment, if it's multi-machine deployment, it is also unable to determine where the batch job is running. Therefore, you are suggested to use absolute paths. In the stand-alone version, the local file path starts with `file://`. In the production environment, it is recommended to use a file system such as *HDFS*. ``` ## SQL Statement Template ```sql LOAD DATA INFILE 'file_name' INTO TABLE 'table_name' OPTIONS (key = value, ...); ``` ### Example The following sql example imports data from a file `data.csv` into a table `t1` using online storage. `data.csv` uses `,` as the column separator. ```sql set @@execute_mode='online'; LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',' ); ``` The following SQL example imports data from file `data.csv` into table `t1`. `data.csv` uses `,` as the column delimiter. The null value will be replaced by a string "NA". ```sql LOAD DATA INFILE 'data.csv' INTO TABLE t1 OPTIONS( delimiter = ',', null_value='NA'); ``` The following example shows an example of soft copy. ```sql set @@execute_mode='offline'; LOAD DATA INFILE 'data_path' INTO TABLE t1 OPTIONS(deep_copy=false); ``` ## Hive Support ### Hive Data Format We support the Hive data format below,others are unsupported(e.g. Binary). | OpenMLDB Data Format | Hive Data Format | | -------------------- | ---------------- | | BOOL | BOOLEAN | | SMALLINT | SMALLINT | | INT | INT | | BIGINT | BIGINT | | FLOAT | FLOAT | | DOUBLE | DOUBLE | | DATE | DATE | | TIMESTAMP | TIMESTAMP | ### Enable Hive Support To support Hive, we need Hive dependencies and Hive conf. #### Hive dependencies in Spark [OpenMLDB Spark](../../../tutorial/openmldbspark_distribution.md) v0.6.7 and above have the Hive dependencies. If you use other Spark release, you should build Hive dependencies in spark, dependencies are in `assembly/target/scala-xx/jars`. Add them to Spark class path. ``` ./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package ``` #### Hive Conf We support connect Hive by metastore service. - spark.conf You can set `spark.hadoop.Hive.metastore.uris` in Spark conf. - taskmanager.properties: add `spark.hadoop.Hive.metastore.uris=thrift://...` in `spark.default.conf`, then restart the taskmanager - CLI: add it in ini conf, use `--spark_conf` to start the CLI, ref [Spark Client Configuration](../../client_config/client_spark_config.md). - Hive-site.xml: You can set `Hive.metastore.uris` in `Hive-site.xml` and add it to Spark home `conf/`. The `Hive-site.xml` example: ``` Hive.metastore.uris thrift://localhost:9083 URI for client to contact metastore server ``` ## CSV Source Data Format We support csv and parquet,but be careful with the csv format. Here is an example. ``` c1, c2 , "","" ab,cd "ef","gh" null,null ``` The first row in csv data is two blank values. - cluster mode: `null`, whatever the null_value is. - local mode: empty string, see [issue3015](https://github.com/4paradigm/OpenMLDB/issues/3015)。 The second row in csv data is two double quotation marks. - cluster mode: default quote is`"`, so it's two empty strings. - local mode: default quote is`\0`, so it's two strings of two double quotation marks. You can set quote to `"` in local mode,but escape rule is different with Spark, `""` means `"`,see[issue3015](https://github.com/4paradigm/OpenMLDB/issues/3015).