A named entity is a widespread concept, the examples of which you see in your daily life. Your name is a named entity. Our platform's name is also a named entity. The name of your city and country are named entities, too. All proper names are named entities. But named entities are not only proper names. Today's date, your mobile number, URL, home address, and outside temperature are all named entities.
Named entities surround us. That's why it is crucial to know how to detect them.
Named Entity Recognition
Named Entity Recognition (NER) is the process of detecting named entities in a text. NER can help you find all organizations mentioned in the text or a particular keyword. Sometimes you want to work with a specific type of entity. If you know how to find all named entities of a particular type in this text, then you can use this data in many other NLP tasks: text simplification, named entity linking, machine translation, and so on.
That's why we divide all named entities into classes. The table below features 18 classes, following the Ontonotes classification:
|
|
DESCRIPTION (with examples in quotes) |
|---|---|
|
|
People, including fictional characters: "John." |
|
|
Nationalities, religious and political groups: "Jewish," "Buddhist." |
|
|
Times shorter than a day: "2 PM" |
|
|
Companies, agencies, institutions: "JetBrains." |
|
|
Countries, cities, regions (districts): "Texas." |
|
|
Geographical entities: "the Ganges," "the Himalayas." |
|
|
Objects, vehicles, foods: "Coca-cola," "Samsung Galaxy Z." |
|
|
Named battles, wars, sports events, catastrophes: "Seven Years' War," "Battle of Poitiers." |
|
|
Percentage, including % |
|
|
Documents passed as laws |
|
|
Absolute (e.g., July 5, 2000) or relative (e.g., a month ago) dates or periods |
|
|
Monetary values: "$300" |
|
|
Measurements: "10 kg", "200 km" |
|
|
Numbers of order: "first," "third." |
|
|
Numerals that do not fall under another type |
|
|
Buildings, airports, roads: "Alexanderplatz," "Tower Bridge." |
|
|
Any named language: "Chinese Mandarin," "English" |
The CoNLL-2003 classification, introduced by the University of Antwerp, offers just four classes: PER, LOC, ORG, and MISC (other).
Both CoNLL-2003 and Ontonotes are marked-up English datasets. They are just two of the many models employed to train language models for the English language.
NER formats
We define two formats for NER types: BIO (a.k.a. IOB) and BIOES. We use just sets of symbols to tag an entity's class. Sometimes it may be just a tag or a prefix. These formats mark nested named entities. A nested named entity is a named entity that contains some other named entities inside.
Let's examine the formats in more detail:
CONSIDERING that on 29 March 2017 the United Kingdom of Great Britain and Northern Ireland ("United Kingdom")...
For example, 29 March 2017 is a nested named entity, as it has some other entities inside: 29, March, 2017. Additional prefixes help us and the computer realize that 29 March 2017 is one inseparable entity while keeping in mind that there are some other entities inside.
With BIO, there are two prefixes and one simple tag. The B-prefix denotes the beginning of the named entity. The I-prefix means that this token is inside the named entity. The O-tag means that the corresponding word is not an entity. So, in our example, the token CONSIDERING has O, which means that it is not a named entity. 29 has the tag B-DATE, March is I-DATE, 2017 is also I-DATE.
The other format, BIOES, has the same prefixes and tags as BIO and two additional prefixes. The E-prefix denotes the ending of the named entity. The S-prefix means that a given named entity contains only one token, only one element. In this format, the date 29 March 2017 will have the following tags: 29 --> B-DATE, March --> I-DATE, 2017 --> E-DATE.
Legal implementation
NER is a massive step for a machine towards understanding a text. Legal documents tend to contain more named entities, so the computer must find them. Let's take, for instance, a fragment from the "Agreement on the withdrawal of the UK from the EU" (Brexit, 2019):
CONSIDERING that on 29 March 2017 the United Kingdom of Great Britain and Northern Ireland ("United Kingdom"), following the outcome of a referendum held in the United Kingdom and its sovereign decision to leave the European Union, notified its intention to withdraw from the European Union ("Union") and the European Atomic Energy Community ("Euratom") in accordance with Article 50 of the Treaty on European Union ("TEU"), which applies to Euratom by virtue of Article 106a of the Treaty establishing the European Atomic Energy Community ("Euratom Treaty")
Suppose you need to find all organizations mentioned in the text: you can do it just by setting the ORG class. In this text, the following organizations are mentioned: the European Union, the European Atomic Energy Community, Euratom.
Information extraction and machine translation
You can use NER to extract crucial information from a text. Imagine you have a dataset for all air flights of an airport. In this dataset, you have columns for the flight number, destination, and time. Take this text, for example:
Emirates flight EK897 departs to Manila at 2 PM.
You can extract the information from this sentence and fill in your dataset. You will end up with the named entities of the location class in the destination column, time class entities in the time column, and so on.
Or suppose you would like to implement machine translation to French on this text. Then, there would be a problem with how we translate token Emirates. Emirates is a short version for United Arab Emirates. If this word is identified as a geopolitical entity, then your program will translate it to French as Émirats. But if it is classified as an organization, then your program will transliterate this token because we generally transliterate entities like ORG or FAC. Stanza, for example, correctly classifies this token as ORG, but Spacy classifies it as GPE. Depending on how you recognize and classify a named entity, a machine translation program decides what to do with a particular token. And you will face such problems everywhere in NLP.
NER in Spacy
Make sure that you have SpaCy installed. We will work with the en_core_web_sm model for English. First, download it:
pip install spacy
python -m spacy download en_core_web_sm
Don't forget to import SpaCy and load the model:
import spacy
nlp = spacy.load("en_core_web_sm")
We will test a fragment from the "Agreement on the withdrawal of the UK from EU" in Spacy. You can print out all named entities in the text using the following code.
text = """CONSIDERING that on 29 March 2017 the United Kingdom of Great Britain and Northern Ireland ("United Kingdom"), following the outcome of a referendum held in the United Kingdom and its sovereign decision to leave the European Union, notified its intention to withdraw from the European Union ("Union") and the European Atomic Energy Community ("Euratom") in accordance with Article 50 of the Treaty on European Union ("TEU"), which applies to Euratom by virtue of Article 106a of the Treaty establishing the European Atomic Energy Community ("Euratom Treaty")"""
doc = nlp(text) # we use a part of the Brexit agreement
for i in doc.ents:
print(i.text) # to print out each entity
Check this code on your computer. Spacy will identify all named entities. Notably, Spacy recognizes "the United Kingdom of Great Britain and Northern Ireland" as two separate entities: the United Kingdom of Great Britain and Northern Ireland. It is because SpaCy thinks we list several objects, and that conjunction is the separator here. It is a common problem since many proper names have an and in them: Republic of Trinidad and Tobago, Antigua and Barbuda, and so on.
Now let's visualize our entity recognizer. Import displacy, an official part of the SpaCy core library. In the render, the function specifies the style parameter style='ent'. If you work in Jupyter Notebook or Google Colab, then specify jupyter=True.
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)
You can also use serve function to get the visualization on a web page. You can set a specific port or make the port selection automatic with an argument auto_select_port. It is a useful function if you are not using python notebooks.
displacy.serve(doc, style="ent", auto_select_port=True)
You will get this:
Displacy highlights all entities. Here you can also see how Spacy classifies the given entities. You may notice that most of them are classified correctly, while some are not. Euratom and Euratom Treaty are wrongly identified as WORK_OF_ART. The correct class is LAW.
NER in NLTK
NER implementation in NLTK is somewhat poorer. It's rarely used but is worth mentioning anyway. The default NE chunker in NLTK is a maximum entropy chunker trained on the ACE corpus. It cannot recognize dates, times, and so on.
First, you need to download 4 NLTK packages:
import nltk
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
Unlike SpaCy, you need to word-tokenize and POS-tag your text in NLTK before implementing NER:
from nltk import word_tokenize, pos_tag
tagged = pos_tag(word_tokenize(text))
Finally, search for entities in the text:
print(nltk.ne_chunk(tagged))
# (S
# CONSIDERING/NN
# that/WDT
# on/IN
# 29/CD
# March/NNP
# 2017/CD
# the/DT
# (ORGANIZATION United/NNP Kingdom/NNP)
# of/IN
# (GPE Great/NNP Britain/NNP)
# and/CC
# (GPE Northern/NNP Ireland/NNP)
# (/(
# ``/``
# (GPE United/NNP Kingdom/NNP)
# ...
You will get a sentence tree where each token has its own POS tag. If there is a named entity, then NLTK will make a tuple with an entity class in its beginning. You can see that NLTK cannot identify 29 March 2017 as a named entity, althoughGreat Britain is correctly classified as a geopolitical entity. You can also notice that NLTK, for some reason, recognized the United Kingdom as an organization.
And with the following code, you will get a syntax tree with all named entities marked on:
nltk.ne_chunk(nltk.tag.pos_tag(text.split()), binary=True)
We do not show you the output because the tree is wide. You can still try building the syntax tree by yourself.
NER in Stanza
Now let's try the same thing in Stanza. Again, make sure you have it installed and imported:
pip install stanza
import stanza
stanza.download('en')
nlp = stanza.Pipeline(lang="en")
In Stanza, the list of all entities in the text is stored in doc.ents.To print all of the entities one by one, use a for-loop and then the attribute .text. To see an entity's class, use the attribute .type .
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ':', ent.type)
You will get this:
# 29 March 2017 : DATE
# the United Kingdom : GPE
# Great Britain : GPE
# Northern Ireland : GPE
# United Kingdom" : GPE
# the United Kingdom : GPE
# the European Union : ORG
# the European Union : ORG
# Union : ORG
# the European Atomic Energy Community : ORG
# Euratom : ORG
# Article 50 of the Treaty on European Union ("TEU") : LAW
# Euratom : GPE
# Article 106a of the Treaty establishing the European Atomic Energy Community ("Euratom Treaty : LAW
Here we see that Stanza identified Euratom correctly as ORG but only in the first case.
NER in Flair
Flair is another powerful NLP library developed by the Humboldt University of Berlin. It has two NER state-of-the-Art models for English: a four-class model (CoNLL-2003) and an eighteen-class one (Ontonotes). We will check out the second one, but first, you must install the library.
pip install flair
In Flair, process your text through the Python class Sentence():
import flair
from flair.data import Sentence
sentence = Sentence(text)
After that, load your model with the SequenceTagger class. Don't forget that we use the ner-English-ontonotes-large model with 18 entity classes.
from flair.models import SequenceTagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")
tagger.predict(sentence)
Finally, we can print out the result.
for entity in sentence.get_spans('ner'):
print(entity)
You will get such an output:
Span[3:6]: "29 March 2017" → DATE (1.0)
Span[6:15]: "the United Kingdom of Great Britain and Northern Ireland" → GPE (1.0)
Span[16:18]: "United Kingdom" → GPE (1.0)
Span[28:31]: "the United Kingdom" → GPE (1.0)
Span[37:40]: "the European Union" → ORG (1.0)
Span[47:50]: "the European Union" → ORG (1.0)
Span[51:52]: "Union" → ORG (1.0)
Span[54:59]: "the European Atomic Energy Community" → ORG (1.0)
Span[60:61]: "Euratom" → ORG (1.0)
Span[65:73]: "Article 50 of the Treaty on European Union" → LAW (0.9995)
Span[74:75]: "TEU" → LAW (0.999)
Span[80:81]: "Euratom" → ORG (0.9999)
Span[84:95]: "Article 106a of the Treaty establishing the European Atomic Energy Community" → LAW (0.9997)
Span[96:98]: "Euratom Treaty" → LAW (0.9546)
Span[x:y] corresponds to a token's index (indexes of the first and the last token of the named entity). A number in the rightmost tuple corresponds to the likelihood that the identified class is correct for the given named entity.
NER in DeepPavlov
DeepPavlov is a conversational AI framework that also offers an interesting NER technique. To install it, use the following code:
pip install deeppavlov
DeepPavlov offers models trained on three different English datasets and four classifications: Ontonotes, CoNLL-2003, DSTC2, and VLSP-2016. We will use the ner_ontonotes_bert model. Let's install it:
python -m deeppavlov install ner_ontonotes_bert
You can use the installed model with this code:
from deeppavlov import configs, build_model
ner_model = build_model(configs.ner.ner_ontonotes_bert, download=True)
Now, let's implement NER on this model:
ents = ner_model([text])
# [[['CONSIDERING', 'that', 'on', '29', 'March', '2017', 'the', ...]],
# [['O', 'O', 'O', 'B-DATE', 'I-DATE', 'I-DATE', 'B-GPE', ...]]]
You will get a list of two lists. The first one contains another list with tokens inside. In the second one, there is a list of tags (entity classes). The tags are listed in the same order as the tokens in the first list.
The tagging scheme you see in DeepPavlov is the BIO format we covered earlier in this topic. It's the first time we put this format to practice.
The last output makes it hard to understand to which token a particular tag corresponds. So, if you want to get a clear and illustrative output, use the code below:
for i in range(len(ents[0][0])):
print(ents[0][0][i], ' : ', ents[1][0][i])
# CONSIDERING : O
# that : O
# on : O
# 29 : B-DATE
# March : I-DATE
# 2017 : I-DATE
# the : B-GPE
# United : I-GPE
# Kingdom : I-GPE
# ...
# which : O
# applies : O
# to : O
# Euratom : B-LAWComparing different libraries for NER implementation
We tested our text on five different NLP libraries and saw that each library has its unique approach to the NER task; different results are the evidence. Consider just one word Euratom the first of the three times it is used in the text. DeepPavlov model identified it as GPE, Stanza and Flair as ORG, Spacy as WORK_OF_ART and NLTK as PERSON.
Here you can compare the libraries we used:
|
|
Spacy |
NLTK |
Stanza |
Flair |
DeepPavlov |
|---|---|---|---|---|---|
|
Number of languages with NER available |
22 |
1 |
22 |
4 |
2 |
|
Visualization |
+ |
+ |
- |
- |
- |
|
Ontonotes classification |
+ |
- |
+ |
+ |
+ |
|
Has prefixes for entity classes |
- |
- |
- |
- |
+ |
|
Number of models available for English |
4 |
1 |
2 |
2 |
6 |
As elaborate as all these NER methods may seem, some issues are yet to be resolved. For instance, the NER modules in SpaCy and Stanza can't resolve semantic ambiguity.
Take, for example, such sentence:
I read London's book
Both libraries will think that London is a geopolitical entity, while it's just the author's nickname. But if we say:
I read Jack London's book
Then both libraries will identify London as a person.
Solving semantic ambiguity is a whole different problem in NLP. It can be solved with the help of Word Semantic Disambiguation (WSD) and Named Entity Linking methods.
Conclusion
To sum up, named entity recognition is one of the core NLP tasks. In this topic, we have learned:
-
Main concepts and challenges of named entity recognition;
-
Named entity classifications;
-
Formats for named entities;
-
How to implement NER tasks in five NLP libraries, including NLTK, Stanza, and Spacy.
Of course, there are many other libraries for NER. Take, for instance, Polyglot, which offers 40 languages for NER. It's always a good idea to take a look at their documentation.
Now, let's put everything we learned to practice!