The goal of this benchmark is to provide an open and interactive interface for AutoML system evaluation on a wide range of tasks and datasets. We design our benchmark for both Academic datasets and real-world industrial datasets in order to get a better understanding of the current state of AutoML system performance. This benchmark is extensible and will have additional dataset groups with their updated versions coming with respect to the benchmark roadmap.
Benchmark solutions are end-to-end AutoML systems suited for both automatically building ML models on a given dataset as well as using their best-fitted model for inference on test data for the given dataset. Solutions are sent to the automatic testing system and evaluated on groups of datasets (see Dataset section).
Each solution is an archive with code that runs in the Docker container environment. Solution archives are submitted into the automatic testing system for evaluation.
Each solution receives the following information:
- task_type: “binary” for binary classification, “multiclass” for multiclass classification, or “reg” for regression
- train_data: path to the training dataset
- test_data: path to the test dataset, without the target variable
- output_path: path where the system must save predictions on the test_data
Solution examples with baselines are available on our GitHub.
Solution evaluation consists of 3 phases:
- Check. The solution is evaluated on a single small dataset. This step is required to check if the solution is correct. Check phase evaluation provides a detailed log of errors for participants.
- Live Test. Solutions are evaluated on a small yet representative subset of datasets from each dataset group (OpenML, Finance, ODS). These datasets are evaluated at any given time and provide live feedback on the leaderboard. Participants are provided with scores and time consumption and each of these datasets as well as their total score on this Live Test on the leaderboard.
- Large Test. Solutions are evaluated on a complete set of datasets from each group on a monthly basis. This evaluation is resource-intensive and carried out with a monthly schedule. Participants are provided with scores and time consumption for each dataset as well as their total score on each dataset group on the Large Test leaderboard. Detailed scores on each dataset of each participant are also provided for further analysis by the community after every run of the Large Test.
Important: Participants must choose up to 2 of their submissions eligible for evaluation on the Large Test (same mechanic as choosing your final submissions on regular competitions).
- September: making participants familiar with the submission system, evaluation on the open group of datasets (OpenML CC 18). Crowdsourcing ODS datasets.
- October: providing participants with the Finance group of datasets. First run of the Large Test. End of datasets crowdsourcing in 2021.
- November: complete runs on all groups of datasets, integration of benchmark into open AutoML course.
Metrics and scores
The complete scoring process of each AutoML solution consists of the following 3 steps:
Step 1. For every dataset group on each dataset evaluate respective metric_value on test data predictions:
- Binary classification: ROC-AUC
- Multiclass classification: ROC-AUC (one-vs-all)
- Regression: RMSE
Step 2. For each metric value on each dataset calculate its relative dataset_score compared to the metric value of linear baseline:
dataset_score = metric_value / metric_baseline
Step 3. For each dataset group calculate its group_score as the average dataset_score within this group. Total_score is the average dataset_score across all datasets in the current benchmark.
- 12Gb memory
- 4 vCPU
- 50Mb solution archive size
- 5 minutes for each dataset