Annotation Guidelines for GROBID NER

Principle

Creating annotated corpus for Named Entities Recognition suppose to identify Named Entities in a text and to classify these Named entities based on the context into a set of classes, 27 classes in the case of grobid-ner.

Similarly as grobid's other CRF models, grobid-ner can bootstrap training data. grobid-ner can automatically generate training data from any text files, labeling tokens with the named entity classes based on the existing model. A human annotator then corrects the generated training data by modifying the labels produced for each token. This curated training data can be added to the existing training data and used to train a new improved model.

Format

The current format of the training data follows the CONLL 2003 NER format, which is a n-column tab separated text file. In our case, the first column is a token, the second column is the NER class and a third optional column givens a word sense:

token0           0         0
token1           B-CLASS   word_sense1
token2           CLASS     word_sense2
token3           0         0

Non-named entity tokens are labeled with the default label 0.

Word senses are optional and correspond to a WordNet synset. They are only indicated for Named Entity tokens.

The B- prefix is used to indicate the beginning of an entity. This B- marker is necessary when several entities of the same NER class are immediatly repeated. As the prefix marker is really needed only in rare cases, it can be omited by default.

The end of a sentence is maked by an empty line.

Tokens and tokenization must not be modified during manual correction. Only the labels can be changed.

Classes

The list of NER classes with examples are given in the classes page.

Largest entity mention

Entities with more than one token can embed sub-entities. The approach currently followed by grobid-ner is to annotate only the largest entity mention and not the sub-entities. See the "largest entity mention" section.

Practical example of correction

For example the sentence:

World War I (WWI) was a global war centred in Europe that began on 28 July 1914 and lasted until 11 November 1918.

The training data automatically generated by grobid-ner is as follow.

World       B-EVENT
War         B-EVENT
I           B-EVENT
(           O
WWI         O
)           O
was         O
a           O
global      O
war         O
centred     O
in          O
Europe      B-LOCATION
that        O
began       O
on          O
28          B-PERIOD
July        B-PERIOD
1914        PERIOD
and         O
lasted      O
until       O
11          B-MEASURE
November    B-PERIOD
1918        PERIOD
.           O

Annotation process:

  1. The first tokens World War I are correctly maked as Named Entities of class EVENT, but incorectly labeled as three independant entities (note the B- at the beginning of each class). The correction will be:
World        B-EVENT
War          EVENT
I            EVENT

Note that as the entity is not adjacent to any other entity, the B- marker is optional.

  1. WWI is not marked as Named Entity but it's an acronym for the previous Entity and should be tagged along with in the same EVENT:
(          EVENT
WWI        EVENT
)          EVENT
  1. Europe refers to the european continent, therefore the class LOCATION is correct.

  2. The tokens 28 July 1914 correspond to a single PERIOD and not two:

28        B-PERIOD
July      PERIOD
1914      PERIOD
  1. lastly the tokens 11 Novembre 1918 has been wrongly identified as two entities:
11          B-PERIOD
November    PERIOD
1918        PERIOD

The result is as following:

World       B-EVENT
War         EVENT
I           EVENT
(           EVENT
WWI         EVENT
)           EVENT
was         O
a           O
global      O
war         O
centred     O
in          O
Europe      B-LOCATION
that        O
began       O
on          O
28          B-PERIOD
July        PERIOD
1914        PERIOD
and         O
lasted      O
until       O
11          B-PERIOD
November    PERIOD
1918        PERIOD
.           O