Ended 4 years ago
145 participants
903 submissions

Materials (0 MB)

Download all materials

Data

A train dataset is provided to train models.  It contains the following columns:

  • id - identifier (has an auxiliary role);
  • fullname - the original full name from the questionnaire (may not have a patronymic);
  • country - a country from the questionnaire;
  • target - the target variable;
  • fullname_true - correct full names (it is presented only in lines containing the "there are typos" class).

There is a test dataset for quality assessment, in which the target and fullname_true columns are missing.

The data for the competition is synthetic. It was obtained from modeling by replacing the most common names, surnames and patronymics in various countries and adding common typos.

Any identification with the real names is coincidental.The label was created manually and it may have incorrectly labeled full names.

Participants are only allowed to use datasets from the regularly updated list that you can find on the forum. If you want to use a dataset that is not included in the list, you should post a link to this dataset on the forum, and the dataset will be added to the list.

Submission Format

You must send a csv file with predictions. It must contain the following columns: idtargetfullname_true (you can omit it for lines with predictions of a class other than "there are typos").

Participants are also provided with an example of the correct test file, as well as a  baseline solution from the organizers.

Evaluation System

In order to identify the correctness of the typed full names, the target metric is F1 with macro averaging.

Accuracy is calculated (the percentage of properly corrected full names)  for the task of correcting typos. The system only calculates the objects of the "there are typos" class. Properly corrected typos require prediction of the presence of typos (you have to predict both the "there are typos" class and correct full name properly).

The final result is the arithmetic mean of the metrics for each of the tasks. Participants are provided with a repository with an extended description of the task from the organizers and code for calculating the competition metric.