This dataset consists of the following files:
Taxonomy definition for the task T2.1. The file contains 103 lines in the form of
is_a(A,B)
which means A is a type of B. All concepts are derived from the “root” concept and are formed using lower-case ASCII letters and the character '_'.
Contains the list of 1106 named entities to be used for task T2.1, one per line.
List of 96 target concepts for task T2.1, one per line.
List of 1107 named entities for task T2.2, one per line.
List of 437 named entities for task T2.3, one per line.
Pruned taxonomy definition for task T2.3.
Contains the entire corpus of 1801 documents, one per line, starting with “filename: ” and followed by a clear-text version of the document without newlines.
An archive of the text-only version of the corpus, one file per document.
An informative picture of the taxonomy used in task T2.1.
Your file should contain one line for each entity from t21.nelist. This line should contain the name of the entity followed by a tab character and the name of the concept (from t21.colist) to which your solution has assigned this named entity.
Example:
Acre city
Adriatic Sea sea
Africa continent
Agatha Christie person
...
Provide two files. In the first file, there should be one line for each concept you have formed. This line should contain the identifier of the concept followed by a tab character and a comma-separated list of named entities (from t22.nelist) contained in the extension of this concept. Concept identifiers should not contain characters other than alphanumeric ASCII characters and the underscore ('_') character.
Example:
Concept_1 Bulgaria, Honduras, Algeria, Albania, ...
Concept_2 Andaman Sea, Adriatic Sea, Libyan sea, ...
Concept_3 Basilica di San Pietro, Church of St Gregory, Govinda Temple, ...
...
The second file should likewise contain one line for each concept. This line should contain the identifier of the concept followed by a tab character and the label of the concept. The labels will be evaluated manually by a set of human judges.
Example:
Concept_1 country
Concept_2 person
Concept_3 place of worship
...
Provide three files. The first one should contain concept identifiers and corresponding lists of named entities (from t23.nelist), and the second one should contain concept identifiers and their human-readable labels. The format of these two files is the same as for task 2.2.
The third file should contain one line for each concept you have formed, consisting of the identifier of this concept, a tab character, and the identifier of its immediate parent concept (from the pruned taxonomy, t23.tax).
Example: (third file only — for the first two files, see examples for Task 2.2)
Concept_1 area
Concept_2 person
Concept_3 sight
...