First open benchmark for Recommender System evaluation, organized in the form of a container-based competition on a group of datasets in an AutoML-design fashion
With this benchmark we aim to advance modern recommender systems (RecSys) and their best practice by providing an open interactive interface for their automated evaluation on a wide range of datasets. The benchmark is designed with an extra emphasis on calculating a wide range of metrics: both classic ones like NDCG@k and MAP@k, discovery assessment with surprisal@k and coverage@k, and technical metrics like evaluation time. The goal of this benchmark is to promote rigorous evaluation of recommender systems from different angles as well as our attempt at standardising RecSys evaluation methodology with a common competition-based interface and open set of evaluation metrics thus making the whole problem of RecSys evaluation more reproducible.
Our benchmark is extensible and starts with the most common RecSys task of item2user recommendation on an implicit matrix of user interactions (without additional features). It can be extended with both extra groups of datasets as well as with extra groups of metrics to evaluate. Other types of RecSys tasks can be organized in future with new benchmark expansions.
We hope that this benchmark will benefit both Academic RecSys research as well as Industrial RecSys practitioners. This benchmark welcomes both pure method-based solutions as well as ensemble approaches (with respect to their utilization of resource constraints) thus advancing the problem of fair and transparent RecSys reproducibility.
Each solution is an archive with code that runs in the Docker container environment. Solution archives are submitted into the automatic testing system for evaluation.
For RecSys benchmark evaluation we organize the submission system in a way similar to AutoML competition designs, suited for testing end2end solutions on a variety of datasets.
After receiving the given information, each submitted solution must execute their algorithms, adapt or learn their RecSys system on the given training data and write predictions for the test data, for each dataset in the given group of datasets.
Solution evaluation consists of 3 phases:
Important: Participants must choose up to 2 of their submissions eligible for evaluation on the Large Leaderboard. This is the same mechanic as choosing your final submissions on regular competitions.
We encourage participants to submit their algorithms in a fair and reproducible manner suited for benchmarking:
For each submitted solution on each given dataset we evaluate the following metrics:
Leaderboard scores are calculated as follows: