AutoFE#

AutoFe supports picking top features based on the dataset and generating the SQL for feature extractions. You can use the generated SQL directly as the feature extraction script.

Usage#

git clone https://github.com/4paradigm/OpenMLDB.git
cd python/openmldb_autofe
pip install .
openmldb_autofe <yaml_path>

yaml Configuration#

More detailed configurations can be found in AutoFE test yaml

The required options are shown below:

apiserver: 127.0.0.1:9080 # we use apiserver to connect OpenMLDB
db: demo_db # the db name when AutoFE do feature selection
tables:
  - table: t1
    schema: "id string, vendor_id int, ..., trip_duration int" # table schema
    file_path: file://... # AutoFE feature selection will use the real feature, so we need data

  - table: t2
    ...

main_table: t1 # set it if only one table; set a main table when multiple tables
label: trip_duration # the label column in main table

windows:
  - name: w1 # main table time window
    partition_by: vendor_id
    order_by: pickup_datetime
    window_type: rows_range
    start: 1d PRECEDING
    end: CURRENT ROW

  - name: w2 # union time window, UNION only supports the same schema tables now
    union: t2
    partition_by: vendor_id
    order_by: pickup_datetime
    window_type: rows_range
    start: 1d PRECEDING
    end: CURRENT ROW

# offline_feature_path: # write to file:///tmp/autofe_offline_feature if not set. If OpenMLDB cluster is distributed, you should ensure that taskmanager and autofe progress can read the path

topk: 10 # the num of top features to select