GROBID NER identifies named-entities and classifies them in 27 classes, as compared to the 4-classes or 7-classes model of most of the existing NER open source tools (usually using the Reuters/CoNLL 2003 annotated corpus, or the MUC annotated corpus).
In addition the entities are often enriched with WordNet sense annotations to help further disambiguation and resolution of the entity. GROBID NER has been developed for the purpose of disambiguating and resolving entities against knowledge bases such as Wikipedia and FreeBase. Sense information can help to disambiguate the entity, because they refine the entity class based on contextual clues.
Named entity classes
Classes quick overview
The following table describes the 27 named entity classes produced by the model.
Class name | Description | Examples |
---|---|---|
ACRONYM | acronym that doesn't belong to another class | DIY, BYOD, IMHO |
ANIMAL | individual name of an animal | Hachikō, Jappeloup |
ARTIFACT | human-made object, including softwares | FIAT 634, Microsoft Word |
AWARD | award for art, science, sport, etc. | Ballon d'or, Nobel prize |
BUSINESS | company / commercial organisation | Air Canada, Microsoft |
CONCEPT | abstract concept not included in another class | English (as language), Communism, Zionism, FTSE 100, CAC40 |
CONCEPTUAL | entity relating to a concept | Greek myths, eurosceptic doctrine |
CREATION | artistic creation, such as song, movie, book, TV show, etc. | Monna Lisa, Mullaholland drive, Kitchen Nightmares, EU Referendum: The Great Debate, Europe: The Final Debate |
EVENT | event | World War 2, Battle of France, Brexit referendum |
IDENTIFIER | systematized identifier such as phone number, email address, ISBN | 2081396505, weirdturtle@gmail.com |
INSTALLATION | structure built by humans | Strasbourg Cathedral, Sforza Castle, Auschwitz camp |
INSTITUTION | organization of people and a location or structure that share the same name | Yale University, European Patent Office, the British government, European Union, City Police, Eurozone |
LEGAL | legal mentions such as article of law, convention, cases, treaty., etc. | European Patent Convention, Maastricht Treaty, Article 52(2)(c) and (3), Roe v. Wade 410 U.S.113 (1973), European Union Referendum Act 2015 |
LOCATION | physical location, including planets and galaxies. | Los Angeles, Northern Madagascar, Southern Thailand, Channel Islands, Earth, Milky Way, West Mountain, Warsaw Ghetto |
MEASURE | numerical amount, including an optional unit of measure | 1 500, six million, 72%, 50°2′9″N 19°10′42″E, AA+ |
MEDIA | media organization or publication | Le monde, The New York Times |
NATIONAL | relating to a location | North American, German, British |
ORGANISATION | organized group of people, with some sort of legal entity and concrete membership | Alcoholics Anonymous, Jewish resistance, Polish undergound |
PERIOD | date, historical era or other time period, time expressions | January, the 2nd half of 2010, 1985-1989, from 1930 to 1945, since 1918, the first four years |
PERSON | first, middle, last names and aliases of people and fictional characters | John Smith |
PERSON_TYPE | person type or role classified according to group membership | African-American, Asian, Conservative, Liberal, Jews, Communist, the British people |
PLANT | name of a plant | Ficus religiosa |
SPORT_TEAM | sport group or organisation | The Yankees |
SUBSTANCE | natural substance | HCN, hydrogen cyanide, gold, asbestos |
TITLE | personal or honorific title, for a person | Mr., Dr., General, President, chairman, doctor, Secretary of State, MP, Prime Minister |
UNKNOWN | entity not belonging to any previous classes | Plan Marshall, ParSiTi, Horizon 2020 |
WEBSITE | website URL or name | Wikipedia, http://www.inria.fr |
Classes Specific guidelines
ACRONYM
Acronyms that don't belong to another class. For example:
- DIY: ACRONYM
but
- the United Nations (UN): United Nations and UN are tagged ORGANISATION
- WW1: EVENT
ANIMAL
ARTIFACT
Human-made object, including softwares.
AWARD
BUSINESS
CONCEPT
➡ Sometimes an entity, in isolation, can be ambiguous, for example British. When it refers to the British English language, it's annotated CONCEPT. (issues #29 and #30).
➡ Economical indexes and bonds are CONCEPT, for example US Dow Jones Industrial Average, CAC40, CETES (issue #54).
CONCEPTUAL
➡ entity relating to a concept
➡ desambiguation between PERSON_TYPE and CONCEPTUAL
PERSON_TYPE | CONCEPTUAL | |
---|---|---|
Criteria | refers to people, a folk, a group of people | modify a common name |
Examples | the eurosceptics held a protest the Communists held a protest the Jewish held a protest |
the eurosceptic doctrine the communist doctrine Zionist newspapers Greek myths Christian newspapers |
CREATION
➡ Artistic creation, such as song, movie, book, TV show, etc (issue #19).
➡ Full bibliographical references are not annotated (issue #48).
EVENT
➡ PERIOD vs. EVENT: an event defines a period, but a period is not necessarily an event, so we annotate as EVENT, for example:
- during the time of the Nazi occupation: EVENT
- during the Czarist regime: EVENT
IDENTIFIER
INSTALLATION
➡ Sometimes a LOCATION name refers to an INSTALLATION name. In that case it's annotated as INSTALLATION. For example Nazi camps (issue #42):
- <ENAMEX type="INSTALLATION">Auschwitz</ENAMEX>
- <ENAMEX type="INSTALLATION">Lager Nordhausen</ENAMEX>
- <ENAMEX type="INSTALLATION">Mittelbau-Dora</ENAMEX>
- <ENAMEX type="INSTALLATION">Mauthausen-Gusen concentration camp</ENAMEX>
INSTITUTION
➡ Criteria to distinguish between ORGANISATION and INSTITUTION:
ORGANISATION | INSTITUTION |
---|---|
organised group of persons | group of persons which share a structure/a location |
group of people within an institution | entity representing -on its own- a stable institution |
random subset of an organization/institution | something established with some autonomy (ex. city police, train police, auxiliary police) |
➡ INSTITUTION vs LOCATION: an INSTITUTION entity is defined as a set of legal entities and not a fixed location. Examples:
European Union, which is not defined by a (fixed) location, but as a set of legal entities with a treaty and particular instances. It may become a fixed location after a long time of integration (like the USA, where the Federal State is an institution).
Eurozone, which is a group of Nations binded by a treaty, a monetary union.
➡ There is no ambiguity between INSTITUTION and PERSON_TYPE. Therefore, even if an INSTITUTION entity applies to a group of people, it won't be annotated PERSON_TYPE, for example:
<ENAMEX type="INSTITUTION">European Union</ENAMEX> citizens
LEGAL
➡ Legal mentions such as article of law, convention, cases, treaties, etc.
➡ there is a graduation between too general and more specific terms, for example:
- Nazi policies, Nazi social policies: too general, not to be annotated (except for Nazi as PERSON_TYPE)
- anti-Jewish legislation, anti-semitic policy: more specific, annotated LEGAL
LOCATION
➡ LOCATION vs INSTITUTION: an INSTITUTION entity is defined as a set of legal entities (for example European Union) and not a fixed location (issue #29).
➡ There is no disambiguation at this level between the different uses of country names (as location, government, army, etc.) (issue #29). For example in:
Austria invaded Italy
they're both annotated LOCATION event though here Austria refers to Austria's army and Italy to the location.
➡ When there are modifiers (geographical, political, etc.) along the location, they are included in the entity, as long as the result refers to a territory. For example:
- <ENAMEX type="LOCATION">Suvalkų area</ENAMEX>
- <ENAMEX type="LOCATION">Pakruojis local rural district</ENAMEX>
- <ENAMEX type="LOCATION">coast of Honolulu</ENAMEX>
- <ENAMEX type="LOCATION">German-occupied Poland</ENAMEX>
- <ENAMEX type="LOCATION">Nazi Germany</ENAMEX>
➡ The articles and prepositions (from, the) are not included in the entity.
➡ In some cases surrounding elements are not included in the entity, for example "west of the" in:
They established safe zones west of the <ENAMEX type="LOCATION">Rocky Mountains</ENAMEX>.
MEASURE
➡ Markers of intervals like over or more are included in the MEASURE tag, example (issue #43):
<ENAMEX type="MEASURE">Over 7,000</ENAMEX> shops and <ENAMEX type="MEASURE">more
than 1,200</ENAMEX> synagogues were damaged or destroyed.
➡ MEASURE is an exception to the Longest Entity Match convention (issue #32): a MEASURE entity is annotated separately only if it is at the beginning of the noun phrase, for example:
- <ENAMEX type="MEASURE">45</ENAMEX><ENAMEX type="PERSON">presidents of the USA</ENAMEX>
- <ENAMEX type="MEASURE">900</ENAMEX><ENAMEX type="PERSON_TYPE">Jews</ENAMEX>
➡ Ordinals (ex. first, second) (issue #14)
-
They should be annotated as MEASURE, as long as they indicate a numerical order in a scale or quantify something (size, date, etc.) that we can enumerate. For example:
-
The history can be divided into four periods: the first, from 1919 to 1940
-
there occurred a boycott of Jewish businesses, which was the first national antisemitic campaign (the "first campaign" is the boycott)
-
second place in the 2009 European elections and first place in the 2014 European elections
-
his was the first time since the 1910 general election
-
-
But referring expressions, or ordinals not really ordering or quantifying, should not be annotated MEASURE. For example:
-
Phrases like among the first to be sent to concentration camps, or one of the first where there is no notion of scale but rather of "beginning".
-
Plurals like in the first jews to be deported, or These were their first elected MPs
=> in these examples it's impossible to enumerate precisely what is « first ». Furthermore, it can't really be replaced by "second" or "third".
-
➡ Expressions measuring nothing are not to be annotated, for example (issue #14):
One of the founders of the Revisionist movement
➡ GPS coordinates are a MEASURE (numerical amounts + units), example 50°2′9″N 19°10′42″E
. (issue #44)
➡ Credit ratings like AA1, AA+ are MEASURE (issue #54).
MEDIA
in order to distinguish between NATIONAL and MEDIA
Examples: in British TV
and Lituanian TV
, British
and Lituanian
should be annotated as NATIONAl
whereas BBC or CNN should be annotated as MEDIA
<ENAMEX type="NATIONAL">Lituanian</ENAMEX> TV
<ENAMEX type="MEDIA">BBC</ENAMEX>
NATIONAL
➡ desambiguation between PERSON_TYPE and NATIONAL
PERSON_TYPE | NATIONAL | |
---|---|---|
Criteria | refers to people, a folk, a group of people | refers to a LOCATION |
Examples | the British are great people. the British emigrants the British people are not great |
a British newspaper a British historian |
ORGANISATION
➡ Ethnic communities are not included in the class ORGANISATION, but in PERSON_TYPE (issue #28).
➡ Criteria to distinguish between ORGANISATION and INSTITUTION:
ORGANISATION | INSTITUTION |
---|---|
organised group of persons | group of persons which share a structure/a location |
group of people within an institution | entity representing -on its own- a stable institution |
random subset of an organization/institution | something established with some autonomy (ex. city police, train police, auxiliary police) |
➡ Sometimes another entity type is included in the ORGANISATION, according to the largest entity match principle, for example:
xml
<ENAMEX type="ORGANISATION">Zionist movement</ENAMEX>
<ENAMEX type="ORGANISATION">Central Committee of the Zionist Union</ENAMEX>
(issue #15)
PERIOD
➡ Date, historical era or other time period, including time measurements like a week, one day, which are quantified measures of time (a PERIOD is a MEASURE but the opposite is not always true, so PERIOD, more specific, wins). (issue #41)
➡ The PERIOD may include precise elements like:
<ENAMEX type="PERIOD">mid afternoon on 27 June 2016</ENAMEX>
➡ Surrounding elements must be included in the NE only if they qualify the range of period or/and change the period type:
- since 1930: all PERIOD.
- from 1930, from 1930 to 1945: both all PERIOD.
- after 1930, before 1930: both all PERIOD.
- next decade, last decade: both all PERIOD.
- 7 years after the war: all PERIOD.
- between 2010 and 2015: all PERIOD.
but
- as early as the 1930s: only 1930s is tagged PERIOD, because as early as doesn't change the period (the 1930s).
- during 1930 and in 1930: the prepositions don't change the period interval, only 1930 is tagged PERIOD.
- seven-year low: only seven-year is tagged PERIOD, since low has no impact on the PERIOD
➡ some terms may be too vague to annotate them as PERIOD, for example the adjective prewar. We may annotate it with other elements, for example LOCATION in the following case:
<ENAMEX type="LOCATION">prewar Nazi Germany</ENAMEX>
➡ Intervals of time defined with surrounding elements like after or since are only considered PERIODs if they are defined regarding to an EVENT. For example these entities are all PERIODs (surrounding element + EVENT):
- <ENAMEX type="PERIOD">7 years after the war</ENAMEX> there was a great boom
- <ENAMEX type="PERIOD">Ten years after the official end of the zombie war</ENAMEX>
but these aren't (surrounding element and its dependencies excluded):
- withhold social benefits to new immigrants for the <ENAMEX type="PERIOD">first
four years</ENAMEX> after they arrived
- The Treaties shall cease to apply to the State in question (...) <ENAMEX type="PERIOD">two
years</ENAMEX>after the notification referred to in paragraph 2
- <ENAMEX type="PERIOD">Seven years</ENAMEX> after the outbreak began
➡ PERIOD vs. EVENT: an event defines a period, but a period is not necessarily an event, so we annotate as EVENT, for example:
- during the time of the Nazi occupation: EVENT
- during the Czarist regime: EVENT
PERSON
PERSON_TYPE
➡ Even though it's an approximation, entities like Jewry (which means Jewish community) are included in this class. (issue #28)
➡ Some entities, in isolation, may belong to several classes, depending on the context. For example British in isolation can be labelled:
- NATIONAL when introducing a relation to Great Britain (LOCATION):
A <ENAMEX type="NATIONAL">British</ENAMEX> historian
- PERSON_TYPE when it is clear that it refers to the folks, not just in relation to a location:
The <ENAMEX type="PERSON_TYPE">British</ENAMEX> are great people.
The <ENAMEX type="PERSON_TYPE">British</ENAMEX> emigrants
The <ENAMEX type="PERSON_TYPE">British</ENAMEX> people are not great
-
CONCEPT when refering to the British language
-
CONCEPTUAL when refering to a CONCEPT, for example here the British culture:
The <ENAMEX type="CONCEPTUAL">British</ENAMEX> folklore.
➡ desambiguation between PERSON_TYPE and NATIONAL
PERSON_TYPE | NATIONAL | |
---|---|---|
Criteria | refers to people, a folk, a group of people | refers to a LOCATION |
Examples | the British are great people. the British emigrants the British people are not great |
a British newspaper a British historian |
➡ desambiguation between PERSON_TYPE and CONCEPTUAL
PERSON_TYPE | CONCEPTUAL | |
---|---|---|
Criteria | refers to people, a folk, a group of people | modify a common name |
Examples | the eurosceptics held a protest the Communists held a protest the Jewish held a protest |
the eurosceptic doctrine the communist doctrine Zionist newspapers Greek myths Christian newspapers |
➡ If an entity in isolation cannot be PERSON_TYPE, it's not annotated PERSON_TYPE even if it fits the criteria above. For example the INSTITUTION European Union cannot be PERSON_TYPE when alone, so EU citizens is annotated:
<ENAMEX type="INSTITUTION">EU</ENAMEX> citizens
PLANT
SPORT_TEAM
SUBSTANCE
TITLE
➡ Personal or honorific title, applied to a person, with a relatively loose definition. The Wikipedia page examples can be useful. For example the following entities are annotated as TITLE: chairman, president, captain .
➡ Generally, the job names (ex. economist, carpenter) are not annotated. For some terms, the context will determine the annotation. engineer for example can be a TITLE or not depending on the country:
-
In France or Germany it is linked with a specific diploma so it's annotated as TITLE if the term is linked to these countries.
-
In UK or USA, it refers to the job, so it's not annotated.
➡ To decide between TITLE and PERSON:
-
if only the TITLE is mentioned, it's annotated TITLE, even though it refers to a person. Examples of entities annotated TITLE:
- He is the President of the United States.
- He is CEO of this company.
- The Chinese Prime minister said this.
- The Queen
-
In case of the largest entity match of TITLE + PERSON, the priority goes to PERSON. For example The President of the United States Barack Obama as a whole is annotated PERSON.
➡ Various examples
- under the direction of the <ENAMEX type="TITLE">National State Archivist</ENAMEX>
(who holds his office in the <ENAMEX type="INSTITUTION">National Archives</ENAMEX>)
- <ENAMEX type="TITLE">Wehrmacht officer</ENAMEX>
- <ENAMEX type="TITLE">Wehrmacht officers</ENAMEX>
- <ENAMEX type="TITLE">German SS officers</ENAMEX>
- <ENAMEX type="TITLE">senior military officers</ENAMEX>
TITLE | not TITLE | |
---|---|---|
member | - Member of Parliament - Member of Congress - Board member |
- members of the British Royal Family - the Eurozone members - members of the SS - party member |
leader | - Great Leader of North Korea - Supreme Leader of Iran |
- leader of the Zionist movement - Nazi leader - council leaders of the ghetto - Jewish resistance leaders |
UNKNOWN
➡ Entities not covered by another class.
➡ Examples:
- Plan Marshall
- Horizon 2020 (a funding programme)
- Antisemitism Yellowbadge logo
- Yellow badge
- Aktion T4 euthanasia programme
- Aktion T4
WEBSITE
Miscellaneous
➡ the classes may apply to fictive entities, for example:
- a multipurpose hand tool, the <ENAMEX type="ARTIFACT">"Lobotomizer"</ENAMEX> or
<ENAMEX type="ARTIFACT">"Lobo"</ENAMEX> (...), for close-quarters combat.
- a tactic re-invented (...) during the "<ENAMEX type="EVENT">Great Panic</ENAMEX>"
➡ There is no specific class for foreign words. They are annotated in one of the existing classes, if relevant (whether they are written in latin or non-latin characters). Otherwise they are not annotated. In all cases, they are identified in parallel by another attribute, orthogonal to the entity class (issue #37).
➡ When foreign words entities are translated, the translation may be annotated with the original entity. It depends to what extent the translation is presented as a Named Entity or on the contrary is more explicative / descriptive (issue #27). A few examples:
TRANSLATION NOT ANNOTATED
- the existence of a <ENAMEX type="CONCEPT">Volksgemeinschaft</ENAMEX> ("people's community")
- they required more <ENAMEX type="CONCEPT">Lebensraum</ENAMEX> ("living space")
- the politician was taken to the <ENAMEX type="INSTITUTION">Questura di Milano</ENAMEX>
(central police station) for questioning
TRANSLATION ANNOTATED
- people use the <ENAMEX type="INSTITUTION">Securité Sociale (Social Security)</ENAMEX>
- The <ENAMEX type="INSTITUTION">Archives Générales du Royaume (National Archives of Belgium)</ENAMEX>
- <ENAMEX type="INSTITUTION">Archives de l’État dans les Provinces (State Archives in the Provinces)</ENAMEX>
➡ Generic terms in referring expressions are not annotated, even if they refer to a named entity. Example:
- Germany was losing the war (refers to an EVENT)
- broader trends in world history (refers to a LOCATION)
- bringing the first credible news to the world of the mass murder that was taking place there (refers to a LOCATION)
➡ Punctuation (like quotation marks) are to be left outside the tags, for example: "<ENAMEX type="PERSON_TYPE">socialists</ENAMEX>"
(issue #26).
➡ Currencies alone (pound sterling, US dollar) should not be annotated (issue #23).
➡ When there is a dash, it can be considered a space, for example Nobel prize-winning economist is annotated (issue #31):
<ENAMEX type="AWARD">Nobel prize</ENAMEX>-winning economist
Out of scope
➡ Specific but common concepts already enumerated in Wikipedia, for example patient zero. Indeed, named entity classes correspond more to particular classes of entities that cannot be enumerated exhaustively in advance.
➡ Specialist terminology (biomedical, for example). Other specialized NER are used.
➡ Tables from Wikipedia have been removed from the annotated corpus (issues #49 and #50).
➡ Wikipedia references are deleted, whether it be markers in the course of the article (example [44] or [112]) as long as full bibliographical strings at the end (issue #48).
Sense information
When possible, senses information are also assigned to entities in the form of one or several WordNet synsets.