Ended 2 years ago
27 participants
14 submissions

RecSys benchmark

First open benchmark for Recommender System evaluation, organized in the form of a container-based competition on a group of datasets in an AutoML-design fashion

Benchmark description 

With this benchmark we aim to advance modern recommender systems (RecSys) and their best practice by providing an open interactive interface for their automated evaluation on a wide range of datasets. The benchmark is designed with an extra emphasis on calculating a wide range of metrics: both classic ones like NDCG@k and MAP@k, discovery assessment with surprisal@k and coverage@k, and technical metrics like evaluation time. The goal of this benchmark is to promote rigorous evaluation of recommender systems from different angles as well as our attempt at standardising RecSys evaluation methodology with a common competition-based interface and open set of evaluation metrics thus making the whole problem of RecSys evaluation more reproducible. 

Our benchmark is extensible and starts with the most common RecSys task of item2user recommendation on an implicit matrix of user interactions (without additional features). It can be extended with both extra groups of datasets as well as with extra groups of metrics to evaluate. Other types of RecSys tasks can be organized in future with new benchmark expansions. 

We hope that this benchmark will benefit both Academic RecSys research as well as Industrial RecSys practitioners. This benchmark welcomes both pure method-based solutions as well as ensemble approaches (with respect to their utilization of resource constraints) thus advancing the problem of fair and transparent RecSys reproducibility.

Submission format

Each solution is an archive with code that runs in the Docker container environment. Solution archives are submitted into the automatic testing system for evaluation. 

For RecSys benchmark evaluation we organize the submission system in a way similar to AutoML competition designs, suited for testing end2end solutions on a variety of datasets.

After receiving the given information, each submitted solution must execute their algorithms, adapt or learn their RecSys system on the given training data and write predictions for the test data, for each dataset in the given group of datasets.

Evaluation procedure

Solution evaluation consists of 3 phases:

  1. Check. The solution is evaluated on a single small dataset. This step is required to check if the solution is correct. Check phase evaluation provides a detailed log of errors for participants.
  2. Live Leaderboard. Solutions are evaluated on a small yet representative subset of datasets. This phase is evaluated at any given time 24/7 with up to 5 submissions per day per participant and provides live feedback on the leaderboard. Participants are provided with detailed metrics and time consumption for each dataset as well as their aggregated scores on the Live Leaderboard.
  3. Large Leaderboard. Solutions are evaluated on a complete set of datasets on a monthly basis, thus forming the actual benchmark results. This evaluation is resource-intensive and is carried out with a monthly schedule. Besides a similar set of scores provided by Live Leaderboard, all participants are provided with detailed scores on each dataset for further analysis and ablation studies by the community after every run of the Large Leaderboard.

Important: Participants must choose up to 2 of their submissions eligible for evaluation on the Large Leaderboard. This is the same mechanic as choosing your final submissions on regular competitions.

We encourage participants to submit their algorithms in a fair and reproducible manner suited for benchmarking:

  • Hyperparameter tuning is allowed either globally or in an automated way. However, hardcoding hyperparameters for each dataset does not benefit research reproducibility and thus is prohibited.
  • Hardcoding in general is not allowed: neither for weights/model parameters, features, etc. Top solutions will be assessed by human experts from the benchmark organizing committee to assure benchmarking fairness among participants. 
  • Cases of potentially unfair benchmark entries will be investigated. In case vialotation of fairness is proven, the participants as well as their results will be held for disqualification from the benchmark.
  • Providing open repositories to your solutions are highly appreciated. We can promote your public repositories and open source solutions to be acknowledged as special open-baseline leaderboard entries by your request. 

Metrics & scores

For each submitted solution on each given dataset we evaluate the following metrics:

  • NDCG@k, for k values 5,10,50
  • MAP@k, for k values 5,10,50
  • Hitrate@k, for k values 5,10,50
  • Coverage@k, for k values 5,10,50
  • Unexp@k, for k values 5,10,50
  • Surprisal@k, for k values 5,10,50
  • Working time

Leaderboard scores are calculated as follows:

  1. For each given dataset calculate every metric for every value of k on the test set. In case the solution didn’t manage to fit the resource constraints or it’s execution was interrupted with an error on a given dataset — for this particular dataset this solution receives 0 score for all metrics. 
  2. For both Live and Large Leaderboards each of the metrics above is averaged across all respective datasets. For the Large leaderboard a detailed table of all scores across all datasets is supplemented for download and analysis. 

Cookies help us deliver our services. By using our services, you agree to our use of cookies.