In this competition, you will be predicting based on different datasets in CSV format.
You will find the following columns:
line_id
— an Id for each linetarget
— target variable (only for train dataset), continuous variable for regression tasks and binary labels (0/1) for classification<type>_<feature>
— type of the feature (type
):
number
— number feature (also could be continuous, categorical or binary variable)string
— string featuredatetime
— date feature in 2010-01-01
or 2010-01-01 10:10:10
formatid
— Id (special purpose categorical variable)The model compressed in ZIP file should be submitted to evaluation system. Submissions will run in local environment using Docker, time and resources for testing are limited. In common case, this is no necessary to be expirienced with Docker for participant.
In root folder of zip archive must be metadata.json
file containing follow:
{ "image": "sberbank/python", "entry_points": { "train_classification": "python train.py --mode classification --train-csv {train_csv} --model-dir {model_dir}", "train_regression": "python train.py --mode regression --train-csv {train_csv} --model-dir {model_dir}", "predict": "python predict.py --test-csv {test_csv} --prediction-csv {prediction_csv} --model-dir {model_dir}" } }
There are: image
— the name of docker image, that will run the submission,
entry_points
— commands which run the submission (train_*
— train models for classification and regression, predict
— prediction with that traineв models). Root directory for the submission will be root directory of zip-achive.
Commands should match the patterns, that will replaced with necessary values dirung execution in test system:
{train_csv}
, {test_csv}
— path to CSV-file containing train or test data{model_dir}
— path to previously created directory that must contain trained model for using{prediction_csv}
— path to file containing predictionsWhen submission will run, in evironment variable TIME_LIMIT
will set maximum time (in sec) for the model execution.
It is guaranteed that the model will have at least 300 sec for train and prediction, however for big datasets this limit will be extended.
For running submission follow environments could be used:
sberbank/python
— Python3 with many libraries imported (details)gcc
- for C/C++ submissionsnode
— for JavaScriptopenjdk
— for Javamono
— for C#Also any other image could be used if it is available on DockerHub. If it will be needed, you could build your own image with the requred software and libraries(see instruction); you have to publish it on DockerHub for using.
There are requirements for running submission in container:
This example running on baseline and public kernel vlarine.
Public datasets for local validation: sdsj2018_automl_check_datasets.zip
First of all you should install Docker for your OS (details). After that pull docker image from DockerHub (it takes some time):
docker pull sberbank/python
Please note that you should have about 20Gb free on your HDD
And if you are a Mactard your Docker have memory limits which could be changed in Docker preferences on "Advanced" tab
There is how to run model training on first dataset:
docker run \
-v {workspace_dir}:/workspace_0001 \
-v {train_csv}:/data/input/train.csv:ro \
-v {model_dir}:/data/output/model \
-w /workspace_0001 \
-e TIME_LIMIT=300 \
--memory 12g \
--name solution_0001_train \
sberbank/python \
python train.py --mode classification --train-csv /data/input/train.csv --model-dir /data/output/model
Here:
{workspace_dir}
— directory which contains metadata.json
;{model_dir}
— directory which will contain thained model (should be created before);{train_csv}
— file contains train datasetEvery path should be absolute
After training this container shoud be stopped or else command with the same image name will not running:
docker stop solution_0001_train
docker rm solution_0001_train
The same is valid for solution_0001_test
.
Now you should do prediction as follow:
{prediction_dir}
— directory which contains file with predictions
docker run \
-v {workspace_dir}:/workspace_0001 \
-v {test_csv}:/data/input/test.csv:ro \
-v {model_dir}:/data/input/model \
-v {prediction_dir}:/data/output \
-w /workspace_0001 \
-e TIME_LIMIT=300 \
--memory 12g \
--name solution_0001_test \
sberbank/python \
python predict.py --test-csv /data/input/test.csv --model-dir /data/input/model --prediction-csv /data/output/prediction.csv
After that compare predition with corresponding test-target.csv file, calculate the score and evaluate your model.
It was example for first dataset - valitation process for other datasets is the same
Description of winner's submissions: Chalearn AutoML challenge (2017-2018) and the archive with this submission
Some articles regardong AutoML from fast.ai
Book about AutoML
Open repositories of SDSJ participans
Cookies help us deliver our services. By using our services, you agree to our use of cookies.