Data — AI Journey 2019: Solving the Graduation Exam

aij2019-data-check.zip

Test examples for training and debugging

256,000 MB

Data format

Examination test is passed to the solution in JSON format. Test consists of a set of question tasks, resource and time constraints and metainformation (like test language).

Each question task object in test contains the following fields:

text - question task text. Suitable for markdown-style formatting. Inside text there can be links to attahcment files like graphic illustrations for the task.
attachments - set of attached files (with their id, mime-type).
meta - metainformation. Arbitrary key-value pairs available for solution and testing system. Used for providing structured information about task. Example: question source, originating exam topic.
answer -format description for expected answer type. Multiple question types are considered, each with their specific parameters and fields:
- choice - choosing one option from the list;
- multiple_choice - choosing a subset of options from the list;
- order - arranging option from the list in correct order;
- matching - correct matching of objects from two sets;
- text - answer in the form of arbitrary text.
score - maximum number of points for the task. Based on this field solutions can prioritise computational resources between tasks.

Evaluation procedure

1. Check-phase

Solution is evaluated on publicly available set of questions with known answers. This phase is important for testing solutions for potential errors and issues with evaluation system interaction. Evaluation result and stdout/stderr output are fully available for the participant.

2. Public Test

Solution is evaluated on a hidden set of questions, available only for organisers. Tasks and answer options within tasks are randomly rearranged each evaluation.

3. Private Test

Solution is evaluated on the final set of questions. Results on the private test are the ones that determine competition winners.

Technical constraints

Solution containers are isolated from outside world:no internet access, no communication between parties.
RAM: 16 Gb;
Maximal solution archive size: 20Gb;
Maximal Docker-image size (publicly available): 20Gb;
Time limit on solution initialization (before task inference): 10 minutes

This time is allocated for loading models into memory.
Time limit on providing answer for a single request: 30 minutes.

Evaluation criteria

Each question task is evaluated by a metric which is relevant to this task type:

choice - accuracy;
multiple_choice - union / intersection;
order - the proportion of correctly ordered pairs
matching - the proportion of correctly matched pairs;
text - special evaluation function, followed by a request for human-expert assessment.

Total solution score is the sum of scores across all question tasks. Each task scores are transformed to 100-point system based on official task correspondance table.

Essay evaluation

Solution evaluation on essay tasks comprises of two stages: automatic scoring and manual human-expert assessment.

Automatic procedure evaluates basic surface-level indicators of the generated texts:

no plagiarism;
original topic correspondance;
orphography;
sentence connectivity, tautology;
language errors (slang, swearing);
paragraph structure;
text volume (not too short/long).

Automatic scoring is given straight away and is not the final score. It is a helpful utility for participants.

Manual essay assessment is carried out by professional experts who follow the official grading standards of exam essays.

Results of manual essay assessments are served to the competition leaderboard 1-2 times a week.

In case automatic scoring indicates that manual essay assessment would lead to 0 points, participant is informed about it and is proposed to prepare a new solution for human assessment.

Baseline

Participants are provided with a fully functional baseline solution for this competition:

Question task classifier (1-27)
27 separet task models, including questions and essay

Models are provided as a technical example as well as for internal validation against stronger solutions of participants.

Baseline model for essay passes through formal evaluation criteria, but doesn't pass through meaningful human assesment grading.

Our website uses cookies, including web analytics services. By using the website, you consent to the processing of personal data using cookies. You can find out more about the processing of personal data in the Privacy policy

Learn More

Materials (256,000 MB)