grobid-ner identifies named-entities and classifies them in 27 classes, as compared to the 4-classes or 7-classes model of most of the existing NER open source tools (usually using the Reuters/CoNLL 2003 annotated corpus, or the MUC annotated corpus).

In addition the entities are often enriched with WordNet sense annotations to help further disambiguation and resolution of the entity. GROBID NER has been developed for the purpose of disambiguating and resolving entities against knowledge bases such as Wikipedia and FreeBase. Sense information can help to disambiguate the entity, because they refine the entity class based on contextual clues.

Named entity classes

Classes quick overview

The following table describes the 27 named entity classes produced by the model.

Class name Description Examples
ACRONYM acronym that doesn't belong to another class DIY, BYOD, IMHO
ANIMAL individual name of an animal Hachikō, Jappeloup
ARTIFACT human-made object, including softwares FIAT 634, Microsoft Word
AWARD award for art, science, sport, etc. Ballon d'or, Nobel prize
BUSINESS company / commercial organisation Air Canada, Microsoft
CONCEPT abstract concept not included in another class English (as language), Communism, Zionism, FTSE 100, CAC40
CONCEPTUAL entity relating to a concept Greek myths, eurosceptic doctrine
CREATION artistic creation, such as song, movie, book, TV show, etc. Monna Lisa, Mullaholland drive, Kitchen Nightmares, EU Referendum: The Great Debate, Europe: The Final Debate
EVENT event World War 2, Battle of France, Brexit referendum
IDENTIFIER systematized identifier such as phone number, email address, ISBN 2081396505, weirdturtle@gmail.com
INSTALLATION structure built by humans Strasbourg Cathedral, Sforza Castle, Auschwitz camp
INSTITUTION organization of people and a location or structure that share the same name Yale University, European Patent Office, the British government, European Union, City Police, Eurozone
LEGAL legal mentions such as article of law, convention, cases, treaty., etc. European Patent Convention,     Maastricht Treaty,     Article 52(2)(c) and (3),     Roe v. Wade 410 U.S.113 (1973),     European Union Referendum Act 2015
LOCATION physical location, including planets and galaxies. Los Angeles, Northern Madagascar, Southern Thailand, Channel Islands, Earth, Milky Way, West Mountain, Warsaw Ghetto
MEASURE numerical amount, including an optional unit of measure 1 500,   six million,   72%,   50°2′9″N 19°10′42″E, AA+
MEDIA media organization or publication Le monde, The New York Times
NATIONAL relating to a location North American, German, British
ORGANISATION organized group of people, with some sort of legal entity and concrete membership Alcoholics Anonymous, Jewish resistance, Polish undergound
PERIOD date, historical era or other time period, time expressions January,   the 2nd half of 2010,   1985-1989,   from 1930 to 1945,   since 1918,   the first four years
PERSON first, middle, last names and aliases of people and fictional characters John Smith
PERSON_TYPE person type or role classified according to group membership African-American, Asian, Conservative, Liberal, Jews, Communist, the British people
PLANT name of a plant Ficus religiosa
SPORT_TEAM sport group or organisation The Yankees
SUBSTANCE natural substance HCN, hydrogen cyanide, gold, asbestos
TITLE personal or honorific title, for a person Mr., Dr., General, President, chairman, doctor, Secretary of State, MP, Prime Minister
UNKNOWN entity not belonging to any previous classes Plan Marshall, ParSiTi, Horizon 2020
WEBSITE website URL or name Wikipedia, http://www.inria.fr

Classes Specific guidelines

ACRONYM

Acronyms that don't belong to another class. For example:

  • DIY: ACRONYM

but

  • the United Nations (UN): United Nations and UN are tagged ORGANISATION
  • WW1: EVENT

issue #20


ANIMAL


ARTIFACT

Human-made object, including softwares.

issue #16


AWARD


BUSINESS


CONCEPT

➡ Sometimes an entity, in isolation, can be ambiguous, for example British. When it refers to the British English language, it's annotated CONCEPT. (issues #29 and #30).

➡ Economical indexes and bonds are CONCEPT, for example US Dow Jones Industrial Average, CAC40, CETES (issue #54).


CONCEPTUAL

➡ entity relating to a concept

➡ desambiguation between PERSON_TYPE and CONCEPTUAL

PERSON_TYPE CONCEPTUAL
Criteria refers to people, a folk, a group of people modify a common name
Examples the eurosceptics held a protest
the Communists held a protest
the Jewish held a protest
the eurosceptic doctrine
the communist doctrine
Zionist newspapers
Greek myths
Christian newspapers

issue #29


CREATION

➡ Artistic creation, such as song, movie, book, TV show, etc (issue #19).

➡ Full bibliographical references are not annotated (issue #48).


EVENT

➡ PERIOD vs. EVENT: an event defines a period, but a period is not necessarily an event, so we annotate as EVENT, for example:

  • during the time of the Nazi occupation: EVENT
  • during the Czarist regime: EVENT

issue #13


IDENTIFIER


INSTALLATION

➡ Sometimes a LOCATION name refers to an INSTALLATION name. In that case it's annotated as INSTALLATION. For example Nazi camps (issue #42):

- <ENAMEX type="INSTALLATION">Auschwitz</ENAMEX>
- <ENAMEX type="INSTALLATION">Lager Nordhausen</ENAMEX>
- <ENAMEX type="INSTALLATION">Mittelbau-Dora</ENAMEX>
- <ENAMEX type="INSTALLATION">Mauthausen-Gusen concentration camp</ENAMEX>

INSTITUTION

➡ Criteria to distinguish between ORGANISATION and INSTITUTION:

ORGANISATION INSTITUTION
organised group of persons group of persons which share a structure/a location
group of people within an institution entity representing -on its own- a stable institution
random subset of an organization/institution something established with some autonomy (ex. city police, train police, auxiliary police)

issue #22

➡ INSTITUTION vs LOCATION: an INSTITUTION entity is defined as a set of legal entities and not a fixed location. Examples:

European Union, which is not defined by a (fixed) location, but as a set of legal entities with a treaty and particular instances. It may become a fixed location after a long time of integration (like the USA, where the Federal State is an institution).

Eurozone, which is a group of Nations binded by a treaty, a monetary union.

issues #29 and #12

➡ There is no ambiguity between INSTITUTION and PERSON_TYPE. Therefore, even if an INSTITUTION entity applies to a group of people, it won't be annotated PERSON_TYPE, for example:

<ENAMEX type="INSTITUTION">European Union</ENAMEX> citizens

issue #30


➡ Legal mentions such as article of law, convention, cases, treaties, etc.

➡ there is a graduation between too general and more specific terms, for example:

  • Nazi policies, Nazi social policies: too general, not to be annotated (except for Nazi as PERSON_TYPE)
  • anti-Jewish legislation, anti-semitic policy: more specific, annotated LEGAL

issue #17


LOCATION

➡ LOCATION vs INSTITUTION: an INSTITUTION entity is defined as a set of legal entities (for example European Union) and not a fixed location (issue #29).

➡ There is no disambiguation at this level between the different uses of country names (as location, government, army, etc.) (issue #29). For example in:

Austria invaded Italy

they're both annotated LOCATION event though here Austria refers to Austria's army and Italy to the location.

➡ When there are modifiers (geographical, political, etc.) along the location, they are included in the entity, as long as the result refers to a territory. For example:

- <ENAMEX type="LOCATION">Suvalkų area</ENAMEX>
- <ENAMEX type="LOCATION">Pakruojis local rural district</ENAMEX>
- <ENAMEX type="LOCATION">coast of Honolulu</ENAMEX>
- <ENAMEX type="LOCATION">German-occupied Poland</ENAMEX>
- <ENAMEX type="LOCATION">Nazi Germany</ENAMEX>

issues #21 and #32

➡ The articles and prepositions (from, the) are not included in the entity.

➡ In some cases surrounding elements are not included in the entity, for example "west of the" in:

They established safe zones west of the <ENAMEX type="LOCATION">Rocky Mountains</ENAMEX>.

issue #21


MEASURE

➡ Markers of intervals like over or more are included in the MEASURE tag, example (issue #43):

<ENAMEX type="MEASURE">Over 7,000</ENAMEX> shops and <ENAMEX type="MEASURE">more
than 1,200</ENAMEX> synagogues were damaged or destroyed.

➡ MEASURE is an exception to the Longest Entity Match convention (issue #32): a MEASURE entity is annotated separately only if it is at the beginning of the noun phrase, for example:

- <ENAMEX type="MEASURE">45</ENAMEX><ENAMEX type="PERSON">presidents of the USA</ENAMEX>
- <ENAMEX type="MEASURE">900</ENAMEX><ENAMEX type="PERSON_TYPE">Jews</ENAMEX>

Ordinals (ex. first, second) (issue #14)

  • They should be annotated as MEASURE, as long as they indicate a numerical order in a scale or quantify something (size, date, etc.) that we can enumerate. For example:

    • The history can be divided into four periods: the first, from 1919 to 1940

    • there occurred a boycott of Jewish businesses, which was the first national antisemitic campaign (the "first campaign" is the boycott)

    • second place in the 2009 European elections and first place in the 2014 European elections

    • his was the first time since the 1910 general election

  • But referring expressions, or ordinals not really ordering or quantifying, should not be annotated MEASURE. For example:

    • Phrases like among the first to be sent to concentration camps, or one of the first where there is no notion of scale but rather of "beginning".

    • Plurals like in the first jews to be deported, or These were their first elected MPs

      => in these examples it's impossible to enumerate precisely what is « first ». Furthermore, it can't really be replaced by "second" or "third".

➡ Expressions measuring nothing are not to be annotated, for example (issue #14):

One of the founders of the Revisionist movement

➡ GPS coordinates are a MEASURE (numerical amounts + units), example 50°2′9″N 19°10′42″E. (issue #44)

➡ Credit ratings like AA1, AA+ are MEASURE (issue #54).


MEDIA

in order to distinguish between NATIONAL and MEDIA Examples: in British TV and Lituanian TV, British and Lituanian should be annotated as NATIONAlwhereas BBC or CNN should be annotated as MEDIA

<ENAMEX type="NATIONAL">Lituanian</ENAMEX> TV
<ENAMEX type="MEDIA">BBC</ENAMEX>

NATIONAL

➡ desambiguation between PERSON_TYPE and NATIONAL

PERSON_TYPE NATIONAL
Criteria refers to people, a folk, a group of people refers to a LOCATION
Examples the British are great people.
the British emigrants
the British people are not great
a British newspaper
a British historian

issue #30


ORGANISATION

➡ Ethnic communities are not included in the class ORGANISATION, but in PERSON_TYPE (issue #28).

➡ Criteria to distinguish between ORGANISATION and INSTITUTION:

ORGANISATION INSTITUTION
organised group of persons group of persons which share a structure/a location
group of people within an institution entity representing -on its own- a stable institution
random subset of an organization/institution something established with some autonomy (ex. city police, train police, auxiliary police)

issue #22

➡ Sometimes another entity type is included in the ORGANISATION, according to the largest entity match principle, for example: xml <ENAMEX type="ORGANISATION">Zionist movement</ENAMEX> <ENAMEX type="ORGANISATION">Central Committee of the Zionist Union</ENAMEX> (issue #15)


PERIOD

➡ Date, historical era or other time period, including time measurements like a week, one day, which are quantified measures of time (a PERIOD is a MEASURE but the opposite is not always true, so PERIOD, more specific, wins). (issue #41)

➡ The PERIOD may include precise elements like:

<ENAMEX type="PERIOD">mid afternoon on 27 June 2016</ENAMEX>

➡ Surrounding elements must be included in the NE only if they qualify the range of period or/and change the period type:

  • since 1930: all PERIOD.
  • from 1930, from 1930 to 1945: both all PERIOD.
  • after 1930, before 1930: both all PERIOD.
  • next decade, last decade: both all PERIOD.
  • 7 years after the war: all PERIOD.
  • between 2010 and 2015: all PERIOD.

but

  • as early as the 1930s: only 1930s is tagged PERIOD, because as early as doesn't change the period (the 1930s).
  • during 1930 and in 1930: the prepositions don't change the period interval, only 1930 is tagged PERIOD.
  • seven-year low: only seven-year is tagged PERIOD, since low has no impact on the PERIOD

➡ some terms may be too vague to annotate them as PERIOD, for example the adjective prewar. We may annotate it with other elements, for example LOCATION in the following case:

<ENAMEX type="LOCATION">prewar Nazi Germany</ENAMEX>

➡ Intervals of time defined with surrounding elements like after or since are only considered PERIODs if they are defined regarding to an EVENT. For example these entities are all PERIODs (surrounding element + EVENT):

- <ENAMEX type="PERIOD">7 years after the war</ENAMEX> there was a great boom

- <ENAMEX type="PERIOD">Ten years after the official end of the zombie war</ENAMEX>

but these aren't (surrounding element and its dependencies excluded):

- withhold social benefits to new immigrants for the <ENAMEX type="PERIOD">first
 four years</ENAMEX> after they arrived

- The Treaties shall cease to apply to the State in question (...) <ENAMEX type="PERIOD">two
years</ENAMEX>after the notification referred to in paragraph 2

- <ENAMEX type="PERIOD">Seven years</ENAMEX> after the outbreak began

➡ PERIOD vs. EVENT: an event defines a period, but a period is not necessarily an event, so we annotate as EVENT, for example:

  • during the time of the Nazi occupation: EVENT
  • during the Czarist regime: EVENT

issues #13 and #25

PERSON


PERSON_TYPE

➡ Even though it's an approximation, entities like Jewry (which means Jewish community) are included in this class. (issue #28)

➡ Some entities, in isolation, may belong to several classes, depending on the context. For example British in isolation can be labelled:

  • NATIONAL when introducing a relation to Great Britain (LOCATION):
    A <ENAMEX type="NATIONAL">British</ENAMEX> historian
  • PERSON_TYPE when it is clear that it refers to the folks, not just in relation to a location:
    The <ENAMEX type="PERSON_TYPE">British</ENAMEX> are great people.
    The <ENAMEX type="PERSON_TYPE">British</ENAMEX> emigrants
    The <ENAMEX type="PERSON_TYPE">British</ENAMEX> people are not great
  • CONCEPT when refering to the British language

  • CONCEPTUAL when refering to a CONCEPT, for example here the British culture:

    The <ENAMEX type="CONCEPTUAL">British</ENAMEX> folklore.

issue #30

➡ desambiguation between PERSON_TYPE and NATIONAL

PERSON_TYPE NATIONAL
Criteria refers to people, a folk, a group of people refers to a LOCATION
Examples the British are great people.
the British emigrants
the British people are not great
a British newspaper
a British historian

issue #30

➡ desambiguation between PERSON_TYPE and CONCEPTUAL

PERSON_TYPE CONCEPTUAL
Criteria refers to people, a folk, a group of people modify a common name
Examples the eurosceptics held a protest
the Communists held a protest
the Jewish held a protest
the eurosceptic doctrine
the communist doctrine
Zionist newspapers
Greek myths
Christian newspapers

issue #29

➡ If an entity in isolation cannot be PERSON_TYPE, it's not annotated PERSON_TYPE even if it fits the criteria above. For example the INSTITUTION European Union cannot be PERSON_TYPE when alone, so EU citizens is annotated:

<ENAMEX type="INSTITUTION">EU</ENAMEX> citizens

issue #30


PLANT


SPORT_TEAM


SUBSTANCE


TITLE

➡ Personal or honorific title, applied to a person, with a relatively loose definition. The Wikipedia page examples can be useful. For example the following entities are annotated as TITLE: chairman, president, captain .

➡ Generally, the job names (ex. economist, carpenter) are not annotated. For some terms, the context will determine the annotation. engineer for example can be a TITLE or not depending on the country:

  • In France or Germany it is linked with a specific diploma so it's annotated as TITLE if the term is linked to these countries.

  • In UK or USA, it refers to the job, so it's not annotated.

➡ To decide between TITLE and PERSON:

  • if only the TITLE is mentioned, it's annotated TITLE, even though it refers to a person. Examples of entities annotated TITLE:

    • He is the President of the United States.
    • He is CEO of this company.
    • The Chinese Prime minister said this.
    • The Queen

  • In case of the largest entity match of TITLE + PERSON, the priority goes to PERSON. For example The President of the United States Barack Obama as a whole is annotated PERSON.

➡ Various examples

- under the direction of the <ENAMEX type="TITLE">National State Archivist</ENAMEX>
(who holds his office in the <ENAMEX type="INSTITUTION">National Archives</ENAMEX>)

- <ENAMEX type="TITLE">Wehrmacht officer</ENAMEX>
- <ENAMEX type="TITLE">Wehrmacht officers</ENAMEX>
- <ENAMEX type="TITLE">German SS officers</ENAMEX>
- <ENAMEX type="TITLE">senior military officers</ENAMEX>
TITLE not TITLE
member - Member of Parliament
- Member of Congress
- Board member
- members of the British Royal Family
- the Eurozone members
- members of the SS
- party member
leader - Great Leader of North Korea
- Supreme Leader of Iran
- leader of the Zionist movement
- Nazi leader
- council leaders of the ghetto
- Jewish resistance leaders

issues #12 and #33


UNKNOWN

➡ Entities not covered by another class.

➡ Examples:

  • Plan Marshall
  • Horizon 2020 (a funding programme)
  • Antisemitism Yellowbadge logo
  • Yellow badge
  • Aktion T4 euthanasia programme
  • Aktion T4

issue #39


WEBSITE


Miscellaneous

➡ the classes may apply to fictive entities, for example:

- a multipurpose hand tool, the <ENAMEX type="ARTIFACT">"Lobotomizer"</ENAMEX> or
 <ENAMEX type="ARTIFACT">"Lobo"</ENAMEX> (...), for close-quarters combat.

- a tactic re-invented (...) during the "<ENAMEX type="EVENT">Great Panic</ENAMEX>"

issue #24

➡ There is no specific class for foreign words. They are annotated in one of the existing classes, if relevant (whether they are written in latin or non-latin characters). Otherwise they are not annotated. In all cases, they are identified in parallel by another attribute, orthogonal to the entity class (issue #37).

➡ When foreign words entities are translated, the translation may be annotated with the original entity. It depends to what extent the translation is presented as a Named Entity or on the contrary is more explicative / descriptive (issue #27). A few examples:

TRANSLATION NOT ANNOTATED
- the existence of a <ENAMEX type="CONCEPT">Volksgemeinschaft</ENAMEX> ("people's community")
- they required more <ENAMEX type="CONCEPT">Lebensraum</ENAMEX> ("living space")
- the politician was taken to the <ENAMEX type="INSTITUTION">Questura di Milano</ENAMEX> 
 (central police station) for questioning

 TRANSLATION ANNOTATED
 - people use the <ENAMEX type="INSTITUTION">Securité Sociale (Social Security)</ENAMEX>
 - The <ENAMEX type="INSTITUTION">Archives Générales du Royaume (National Archives of Belgium)</ENAMEX>
 - <ENAMEX type="INSTITUTION">Archives de l’État dans les Provinces (State Archives in the Provinces)</ENAMEX>

➡ Generic terms in referring expressions are not annotated, even if they refer to a named entity. Example:

  • Germany was losing the war (refers to an EVENT)
  • broader trends in world history (refers to a LOCATION)
  • bringing the first credible news to the world of the mass murder that was taking place there (refers to a LOCATION)

issue #45

➡ Punctuation (like quotation marks) are to be left outside the tags, for example: "<ENAMEX type="PERSON_TYPE">socialists</ENAMEX>" (issue #26).

Currencies alone (pound sterling, US dollar) should not be annotated (issue #23).

➡ When there is a dash, it can be considered a space, for example Nobel prize-winning economist is annotated (issue #31):

<ENAMEX type="AWARD">Nobel prize</ENAMEX>-winning economist

Out of scope

➡ Specific but common concepts already enumerated in Wikipedia, for example patient zero. Indeed, named entity classes correspond more to particular classes of entities that cannot be enumerated exhaustively in advance.

➡ Specialist terminology (biomedical, for example). Other specialized NER are used.

➡ Tables from Wikipedia have been removed from the annotated corpus (issues #49 and #50).

➡ Wikipedia references are deleted, whether it be markers in the course of the article (example [44] or [112]) as long as full bibliographical strings at the end (issue #48).

Sense information

When possible, senses information are also assigned to entities in the form of one or several WordNet synsets.