Stanza is one of many libraries for text processing. This is a collection of tools for 66 natural languages: Russian, English, Bulgarian, Czech, etc. It allows users to split texts into sentences and words, extract base word forms and their morphological features, analyze the syntactic structure of sentences, or train new models.
In this topic, we are going to cover the basic features of Stanza.
Installation
To use Stanza, you need to install it with pip.
pip install stanzaYou can import it afterward.
import stanzaIt is essential to download a pre-trained Stanza language model before commencing NLP tasks. You can do it using the stanza.download() command. You can either specify a full language name (for instance, "English") or write down a short code (for instance, "en").
stanza.download("en")You can find more information on models and languages in the Stanza Available Models & Languages official documentation section.
Of course, Stanza can't cover all the existing human languages, but you can add a new language and train a Stanza package for it. For more information visit this page which contains a blow-by-blow description of adding a language.
Stanza may require additional packages and programs during installation. In this Stanza official guide, you can look through different installation options and find those suitable for you. If there are still problems with the library, you can run it in the Google Colab environment.
Stanza vs SpaCy
In our previous topics, we have discussed the differences between SpaCy and NLTK. This time we are going to cover Stanza's advantages over SpaCy. Their main features are presented in the table below.
Criteria | SpaCy | Stanza |
Number of Supported Languages | 15 | 66 |
Raw Text Processing | Yes | Yes |
Fully Neural System | No | Yes |
Pretrained Models | Yes | Yes |
State-of-the-art Performance | No | Yes |
Stanza can be a useful alternative for other popular NLP toolkits. It can be easily adapted to different texts, and the result of state-of-the-art performance will stay at a high level compared to the existed libraries. If you want to learn more about the differences between Stanza and other NLP libraries, you can read the article Stanza: A Python Natural Language Processing Toolkit for Many Human Languages published by the Association for Computational Linguistics.
NLP in Stanza
Stanza supports a lot of processors of the text processing pipeline. Processors in Stanza are the procedures that can be used for text processing: tokenization, lemmatization, etc. To work with them, you first need to install them using stanza.Pipeline():
nlp = stanza.Pipeline(lang="en", processors="tokenize, pos, lemma, depparse, ner, sentiment")We store the instance of the Pipeline class in the nlp variable, which we are going to use later on. Note that it accepts the language you are going to work with, as well as the names of the desired processors. If you do not need some of the processors for your experiments, you may omit them.
To start working with a text, first, we need to create the doc variable by giving our text as an argument to the nlp pipeline:
doc = nlp('There are a lot of vegetables in my kitchen garden! My granny will gather all of them!')And that is basically it! Our text is now processed with all the specified processors, all we need is to get access to this knowledge. For example, the text is split into sentences and tokens that can be accessed by iterating over them like in the following example:
for sentence in doc.sentences:
print(sentence.text)
# There are a lot of vegetables in my kitchen garden!
# My granny will gather all of them!
second_sentence = doc.sentences[1]
for word in second_sentence.words:
print(word.text)
# My
# granny
# will
# gather
# all
# of
# them
# !Let's have a closer look at the doc.sentences . It begins like this:
It produces a list of lists. Each sublist represents one sentence and contains dictionaries with information about each token: its id in a sentence, its morphological features, its headword, etc. — all information that our pipeline covers. We will explain most of them in more detail in the following sections. For now, remember the general structure of the doc.sentences.
POS-tagging and Lemmatization
We can use Stanza to obtain lemmas and POS-tags for each word in the sentence. Use the word.lemma method for obtaining lemmas, and the word.pos method for POS-tags, and the word.feats for obtaining morphological features of each word form:
doc = nlp('Call the police! My cat is missing!')
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, '-->', word.lemma, ':', word.feats)
# Call --> call : Mood=Imp|VerbForm=Fin
# the --> the : Definite=Def|PronType=Art
# police --> police : Number=Plur
# ! --> ! : None
# My --> my : Number=Sing|Person=1|Poss=Yes|PronType=Prs
# cat --> cat : Number=Sing
# is --> be : Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
# missing --> missing : Degree=Pos
# ! --> ! : NoneThere are a lot of different tags that specify the morphological features of words. For instance, the PronType=Art refers to the article and the Poss=Yes means that the pronoun is possessive. See the Universal Features section of the official documentation to find more information about the tags.
Syntactic Parsing
We can parse the syntactic relations between words in the sentence in Stanza. Each word in the sentence has the head and the dependency relation between the words are shown with deprel. The dependency relation can refer to the syntactic properties of a sentence, semantic properties, or their combination.
doc = nlp("The cat is here!")
for sent in doc.sentences:
for word in sent.words:
print(word.text, '<--', sent.words[word.head-1].text if word.head > 0 else "root", ':', word.deprel)
# The <-- cat : det
# cat <-- here : nsubj
# is <-- here : cop
# here <-- root : root
# ! <-- here : punctThe first element in the sequence is a dependant word, the second element is a head, and the last one is a type of relation. Mind the sent.words[word.head-1].text. If you recall the picture with dictionaries in one of the previous sections, you can notice that an id of each word starts from 1, meanwhile, the indexing of elements in Python lists starts with 0. So, we should deduce 1 from each index to get the right headword for each dependant one.
As for dependencies, the det refers to the relations between the definite article (dependant word) and the noun (head). You can find more information about the dependency types in the Universal Dependency Relations section.
In the example above, if the word root is in the columns with the head values, it means that the analyzed word is treated as the root of the sentence.
Named Entity Recognition
In Stanza, we can also perform named entity recognition:
doc = nlp('London is the capital of Great Britain.')
for ent in doc.ents:
print(ent.text, ':', ent.type)
# London : GPE
# Great Britain : GPEAs you can see, Stanza correctly identifies London and Great Britain as geopolitical entities (GPE). We use the ent.text for printing a named entity itself and the ent.type to specify the named entity class. If you want to find out about other entity types, read the Available NER Models section.
Sentiment Analysis
The sentiment analysis is used to classify opinions as negative, neutral, or positive. The result of the sentiment analysis in Stanza is represented by 0, 1, or 2 respectively:
doc = nlp('Yesterday I saw the film. It was awful.')
for sentence in doc.sentences:
print(sentence.sentiment)
# 1
# 0In the example above, we can see that Stanza classified the first sentence as a neutral one, while the second one is thought to contain negative information.
Conclusion
In this topic, we have explained how to work with Stanza. So far, we have learned:
how to install Stanza and download language models;
the common and distinct features of Stanza and SpaCy;
how to implement the basic procedures of NLP.
If you want to learn more about Stanza, read the documentation on the official site.
You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.