Ended 5 years ago
398 participants
2389 submissions

Materials (256,000 MB)

Download all materials
aij2019-data-check.zip
Test examples for training and debugging
256,000 MB

Data format

Examination test is passed to the solution in JSON format. Test consists of a set of question tasks, resource and time constraints and metainformation (like test language).

Each question task object in test contains the following fields:

  • text - question task text. Suitable for markdown-style formatting. Inside text there can be links to attahcment files like graphic illustrations for the task.
  • attachments - set of attached files (with their id, mime-type).
  • meta - metainformation. Arbitrary key-value pairs available for solution and testing system. Used for providing structured information about task. Example: question source, originating exam topic.
  • answer -format description for expected answer type. Multiple question types are considered, each with their specific parameters and fields:
    • choice - choosing one option from the list;
    • multiple_choice - choosing a subset of options from the list;
    • order - arranging option from the list in correct order;
    • matching - correct matching of objects from two sets;
    • text - answer in the form of arbitrary text.
  • score - maximum number of points for the task. Based on this field solutions can prioritise computational resources between tasks.

Evaluation procedure

1. Check-phase

Solution is evaluated on publicly available set of questions with known answers. This phase is important for testing solutions for potential errors and issues with evaluation system interaction. Evaluation result and stdout/stderr output are fully available for the participant.

2. Public Test

Solution is evaluated on a hidden set of questions, available only for organisers. Tasks and answer options within tasks are randomly rearranged each evaluation.

3. Private Test

Solution is evaluated on the final set of questions. Results on the private test are the ones that determine competition winners.

Technical constraints

  • Solution containers are isolated from outside world:no internet access, no communication between parties.

  • RAM: 16 Gb;

  • Maximal solution archive size: 20Gb;

  • Maximal Docker-image size (publicly available): 20Gb;

  • Time limit on solution initialization (before task inference): 10 minutes

    This time is allocated for loading models into memory.

  • Time limit on providing answer for a single request: 30 minutes.

Evaluation criteria

Each question task is evaluated by a metric which is relevant to this task type:

  • choice - accuracy;
  • multiple_choice - union / intersection;
  • order - the proportion of correctly ordered pairs
  • matching - the proportion of correctly matched pairs;
  • text - special evaluation function, followed by a request for human-expert assessment.

Total solution score is the sum of scores across all question tasks. Each task scores are transformed to 100-point system based on official task correspondance table.

Essay evaluation

Solution evaluation on essay tasks comprises of two stages: automatic scoring and manual human-expert assessment.

Automatic procedure evaluates basic surface-level indicators of the generated texts:

  • no plagiarism;
  • original topic correspondance;
  • orphography;
  • sentence connectivity, tautology;
  • language errors (slang, swearing);
  • paragraph structure;
  • text volume (not too short/long).

Automatic scoring is given straight away and is not the final score. It is a helpful utility for participants.

Manual essay assessment is carried out by professional experts who follow the official grading standards of exam essays.

Results of manual essay assessments are served to the competition leaderboard 1-2 times a week.

In case automatic scoring indicates that manual essay assessment would lead to 0 points, participant is informed about it and is proposed to prepare a new solution for human assessment.

Baseline

Participants are provided with a fully functional baseline solution for this competition:

  • Question task classifier (1-27)
  • 27 separet task models, including questions and essay

Models are provided as a technical example as well as for internal validation against stronger solutions of participants.

Baseline model for essay passes through formal evaluation criteria, but doesn't pass through meaningful human assesment grading.

Cookies help us deliver our services. By using our services, you agree to our use of cookies.