Computer scienceData scienceNLPText processingText normalization

Overview of text normalization

22 minutes read

In this topic, we are going to learn more about text normalization, one of the steps of text preprocessing. Let's imagine that we have some text and want to count all instances of the verb "play". Sounds easy, right? What about word forms like "played", "plays", or "playing"? They are all forms of one single verb. We can count them manually if our text is short, but with big data, it is just not possible. This is where text normalization (or word normalization) steps in. The main idea is to reduce different forms of one word to a single form. With this algorithm all forms like "plays", "playing", or "played" will be changed to "play".

There are two approaches to text normalization: stemming and lemmatization. Both are widely used in information retrieval tasks, search engines, topic modeling, and other NLP applications. In the upcoming sections, we will discuss the differences between the approaches, as well as their implementations in the NLTK library.

Note that before stemming or lemmatization, it is better to tokenize your text and get rid of digits and punctuation marks. Otherwise, most algorithms will recognize "play!" not as "play", but rather as an unknown word, so it will not be processed correctly.

Word Stem vs Lemma

Word stem is a part of the inflected word that is responsible for its lexical meaning. It is the base word form of one lexeme. A word stem might not be a real word. In other words, a word stem is a word form with no affixes. The term is interchangeable with the word root.

In most cases, a word stem doesn't change during declension. But there are exceptions in different languages. For example, irregular verbs in English.

Lemma is a valid word form. It is sometimes called a dictionary form or a canonical form. A lemma is the form of the word that you see in the dictionary. For example the words "did" , "done", "doing", "do" have one common lemma — "(to) do".

Adjectives, nouns, and pronouns are generally represented in lemmas as nominative case singular number form.

Verbs in most European languages (French, Russian, Italian) and many others (Persian) use their infinitive forms as lemmas, but be careful, it is not a universal rule. For example, Latin dictionaries display verbs as 1st person singulars in the present tense. Let's check it on a particular Latin verb: "credit" is a 3rd person singular form of the present tense verb (meaning "He believes"), the infinitive is "credere" (meaning "to believe"), but the lemma is "credo", a 1st person singular form of the present tense (meaning "I believe"). This rule is also true for Ancient Greek verbs, though many dictionaries give six principal forms of any verb.

Lemma is very important for inflective languages, like Latin, Russian and Hungarian. In English, lemmas may be useful for irregular verbs since most stemmers cannot identify stems in verbs like flew.

Stemming in NLTK

A stem is the most important part of the word, and other word parts (affixes) are added to it. For instance, if we take "play" and add the affix "-ed" to it, we get the past form of the verb.

There are many ways to carry out stemming. Complex methods, like NLTK and other NLP libraries, are based on a handful of them. The first way is just suffix-stripping. This algorithm is based on a list of suffixes that pertain to a certain language. If you detect that played is a verb (and if we know that in English — -ed endings are common), then we can strip off this ending. The other case is when you have the verb increased. A stemmer will detect that it is a verb and that it has an ending -ed, but in the end, you'll get increas. Some approaches don't require the base form of the word to match the real word form, and then it's okay. So does, for example, Snowball Stemmer. Alternatively, a stemmer could give a real word form in the output. Then, we may set a list of all word base forms. A stemmer will delete the last letter and then check if the given form matches any of the words in the list and will repeat this circle until it gets a real word form. In the example above, the stemmer will delete the first letter from the right side just once: increase.

More elaborate stemmers may replace one popular ending with another: for example, the noun friendliness can be stemmed as friendly if you replace ending -liness by -ly. Most stemmers, though, are not so advanced, so you'll end up with just friendli. This approach is very close to another complex approach to text normalization — lemmatization.

Let's see how to carry it out using NLTK. It has different algorithms for stemming and we will learn how to use them. First, we need to import the library:

import nltk

For the English language, we normally use the Porter stemmer and the Lancaster stemmer. You can find these and some other stemming algorithms in the nltk.stem module.

The Porter stemmer is the earliest and the most popular algorithm for this task. To use it, we need to import the PorterStemmer class from the nltk.stem module and then create an object for this class. It is used only for English. After that we call the stem() method and put the word in brackets:

from nltk.stem import PorterStemmer


porter = PorterStemmer()
porter.stem('played')   # play
porter.stem('playing')  # play

The Snowball stemmer can be seen as an improvement over the original Porter stemmer as it gives slightly better results. The SnowballStemmer class in NLTK also supports 13 non-English languages such as Spanish, French, Russian, German, Swedish, and others. To use this algorithm, we need to create a new instance of the class and specify the language.

from nltk.stem import SnowballStemmer


snowball = SnowballStemmer('english')
snowball.stem('playing')  # play
snowball.stem('played')   # play

As we said earlier, the Snowball stemmer works better than Porter. Let's compare the examples below:

snowball.stem('generously')   # generous
porter.stem('generously')     # gener

snowball.stem('dangerously')  # danger
porter.stem('dangerously')    # danger

The Porter stemmer would remove not only the affix "-ly" but also "-ous" from the input, as it would do for the word "dangerously". In this case, it is unnecessary and incorrect. The Snowball stemmer provides a better result for the word "generously".

NLTK also has the implementation of the Lancaster or Paice-Husk stemming algorithm. To use the Lancaster stemmer, we need to do the same as before, but now we need to import the LancasterStemmer class from the nltk.stem package:

from nltk.stem import LancasterStemmer


lancaster = LancasterStemmer()  
lancaster.stem('played')       # play
lancaster.stem('playing')      # play
lancaster.stem('generously' )  # gen
lancaster.stem('dangerously')  # dang

All stemmers are quite similar, but the original Porter stemmer and Snowball stemmer provide better results, while the Lancaster stemmer works faster. So, if you are working with really big text data and need to process it in a short time, use Lancaster Stemmer. If you need more accurate results — choose the Snowball or Porter stemmers.

Let's see how Snowball Stemmer works with different parts-of-speech. First, let's check adjectives.

adjs = [
    "effective",
    "dangerous",
    "careful",
    "monetary",
    "kind",
    "supportive",
    "rarer",
    "rarest",
]
for a in adjs:
    print(a, " --> ", snowball.stem(a))


#  effective  -->  effect
#  dangerous  -->  danger
#  careful  -->  care
#  monetary  -->  monetari
#  kind  -->  kind
#  supportive  -->  support
#  rarer  -->  rarer
#  rarest  -->  rarest

So, Snowball Stemmer converts most adjectives into nouns, except for comparative, superlative, and ones with ending -ed. Now let's check nouns:

nouns = [
    "wall",
    "handcraftsman",
    "reservoir",
    "airport",
    "foundation",
    "trichotillomania",
    "jewelry",
    "Frenchman",
    "chopper",
    "supercars",
    "men",
]
for a in nouns:
    print(a, " --> ", snowball.stem(a))

#  wall  -->  wall
#  handcraftsman  -->  handcraftsman
#  reservoir  -->  reservoir
#  airport  -->  airport
#  foundation  -->  foundat
#  trichotillomania  -->  trichotillomania
#  jewelry  -->  jewelri
#  Frenchman  -->  frenchman
#  chopper  -->  chopper
#  supercars  -->  supercar
#  men  -->  men

Most nouns are left untouched. Exceptions: the endings -tion, -s and -y. Only removal of -s was relevant. Now, let's check verbs:

verbs = ['driven', 'swallowed', 'chewing', 'got', 'are', 'blew', 'saw']
for a in verbs:
    print(a, ' --> ', snowball.stem(a))

#  driven  -->  driven
#  swallowed  -->  swallow
#  chewing  -->  chew
#  got  -->  got
#  are  -->  are
#  blew  -->  blew
#  saw  -->  saw

Most verbs are left unchanged too. Snowball Stemmer transformed only swallowed and chewing, two obvious examples with -ed and -ing endings.

You can learn more about the non-English stemmers available in NLTK on nltk.org.

Lemmatization in NLTK

Now let's talk about lemmatization. Even though it may seem similar to stemming, at first sight, there is a difference in how these algorithms work. Stemmers just remove affixes while lemmatizers are like people — they analyze the word, its context, its part of speech, and then give the answer. The result is always a real word in its dictionary form called a lemma. In general, lemmatizers rely on dictionaries (or corpora) when looking for lemmas.

To use lemmatizer from the NLTK library, you need to make sure that you have access to WordNet — you can do it by typing nltk.download('wordnet').

WordNet is a large lexical database of the English language, which is used for lemmas in the NLTK lemmatizer. There is only one algorithm for lemmatization in NLTK.

We need to import the WordNetLemmatizer class from the nltk.stem module and create an instance of the class. We use the method lemmatize() that takes a word we want to lemmatize as an argument.

from nltk.stem import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('playing')  # playing

As you can see, the word remained unchanged after lemmatization. As far as we know, lemmatizers need to know the context or the part of speech of the word. The default part of speech here is a noun and, as a noun, the word 'playing' is its own lemma. In the small chart below you can find the part-of-speech tags in WordNet:

Part of speech	Tag
Noun	n
Verb	v
Adjective	a
Adverb	r

So, we just need to assign the tag that corresponds to the part of speech for our word.

lemmatizer.lemmatize('playing', pos='v')  # play
lemmatizer.lemmatize('plays')             # play

When we lemmatize a text we cannot manually tag all words, so you need to define a function that will assign a part-of-speech tag to each word.

Note that part-of-speech tags must be the same as in the WordNet! If the tags you got in the result of POS-tagging do not correspond to the ones in WordNet, you will need to convert them. For example, the default NLTK tagger (nltk.pos_tag()) uses such tags as NN that stands for "Noun, singular or mass" and VP that stands for "Verb, the base form". If you want to lemmatize your text, you need to create a function that converts these tags to the WordNet ones:

def get_wordnet_tags(pos):
    if pos == 'NN':
        return 'n'
    elif pos == 'VP':
        return 'v'
    # and so on

Now, let's check how WordNetLemmatizer processes different parts of speech:

words = [
    "effective",
    "dangerous",
    "careful",
    "monetary",
    "kind",
    "supportive",
    "rarer",
    "rarest",
]
for a in words:
    print(a, " --> ", lemmatizer.lemmatize(a, pos="a"))


#  effective  -->  effective
#  dangerous  -->  dangerous
#  careful  -->  careful
#  monetary  -->  monetary
#  kind  -->  kind
#  supportive  -->  supportive
#  rarer  -->  rare
#  rarest  -->  rare

Comparative and superlative adjectives were lemmatized correctly, and all other adjectives were left unchanged, as they should have been.

Next are nouns:

words = [
    "wall",
    "handcraftsman",
    "reservoir",
    "airport",
    "foundation",
    "trichotillomania",
    "jewelry",
    "Frenchman",
    "chopper",
    "supercars",
    "men",
]
for a in words:
    print(a, " --> ", lemmatizer.lemmatize(a, pos="n"))


#  wall  -->  wall
#  handcraftsman  -->  handcraftsman
#  reservoir  -->  reservoir
#  airport  -->  airport
#  foundation  -->  foundation
#  trichotillomania  -->  trichotillomania
#  jewelry  -->  jewelry
#  Frenchman  -->  Frenchman
#  chopper  -->  chopper
#  supercars  -->  supercars
#  men  -->  men

All nouns are left without changes: plurals were to be transformed into their single forms. It's also worth mentioning that the lemmatizer does not convert words into lower case (see Frenchman), while Snowball Stemmer converted this noun into lowercase. At last, let's check lemmatization for verbs:

words = ['driven', 'swallowed', 'chewing', 'got', 'are', 'blew', 'saw']
for a in words:
    print(a, ' --> ', lemmatizer.lemmatize(a, pos='v'))


#  driven  -->  drive
#  swallowed  -->  swallow
#  chewing  -->  chew
#  got  -->  get
#  are  -->  be
#  blew  -->  blow
#  saw  -->  saw

The lemmatizer couldn't process saw correctly, but still, it works better than the Snowball Stemmer. You just need to keep in mind that the algorithm may have mistakes while working with irregular verbs.

Stemming vs Lemmatization

What should you choose? The answer to this question mainly depends on the task and the language you are dealing with. There is no universal stemmer or lemmatizer for all languages — each language is unique and has specific rules. So you need to use different algorithms with different languages.

For some languages, both stemming and lemmatization give good results, but for others, it is better to opt for lemmatization. For instance, languages like Russian, Latin, Finnish, or Turkish have grammatical cases, meaning that words have different affixes depending on their role in a sentence. Here is an example from Latin: "rēx respondit" can be translated as "the king replied", and rēgis fīlia — as "the daughter of the king". Both these words, "rēx" and "rēgis", are forms of the noun "rēx", which stands for "king". Cutting off the affixes will give us two different forms, so it is better to apply lemmatization here. Of course, it is possible to use stemming for such languages, but the list of rules on what affixes in which cases should be removed are going to be pretty long and complex.

Also, if you need to get valid words after text normalization, go for lemmatization. Sometimes, different forms of one word look completely different, and we just cannot write rules for them. For instance, in English, there are irregular verbs (be — am, is, are), plural forms of nouns (goose — geese, mouse — mice), and comparative and superlative adjectives (bad — worse — the worst). Lemmatizers will detect such cases and give the correct word as a result, while stemmers will not. Finally, do not forget about the resources. Lemmatizers usually scan a big dictionary or rely on corpora to find lemmas. It can take a lot of time. If you need to normalize text faster, stemming is the right choice.

Let's sum up the main points of using stemming and lemmatization.

	Stemming	Lemmatization
Pros	works fast (good for big data) gives good results for some languages (English) does not require much memory	gives a valid word as a result recognizes cases of suppletion
Cons	gives as result a stem that may not be a real word	takes longer to process

Lemmatization in Spacy

Spacy provides only a lemmatizer. To use this lemmatizer, download the language model:

import spacy

nlp = spacy.load('en_core_web_sm')

Now, we can lemmatize a text. Let's try lemmatizing a list of adjectives. Mind that input should be raw text.

text = nlp('effective dangerous careful monetary kind supportive rarer rarest')

for word in text:
    print(word.text, ' --> ', word.lemma_)


#  effective  -->  effective
#  dangerous  -->  dangerous
#  careful  -->  careful
#  monetary  -->  monetary
#  kind  -->  kind
#  supportive  -->  supportive
#  rarer  -->  rarer
#  rarest  -->  rarest

As we see, the Spacy lemmatizer cannot process comparative and superlative adjectives. Now let's check nouns:

for word in text:
    print(word.text, ' --> ', word.lemma_)


#  wall  -->  wall
#  handcraftsman  -->  handcraftsman
#  reservoir  -->  reservoir
#  airport  -->  airport
#  foundation  -->  foundation
#  trichotillomania  -->  trichotillomania
#  jewelry  -->  jewelry
#  Frenchman  -->  Frenchman
#  chopper  -->  chopper
#  supercars  -->  supercar
#  men  -->  man

The lemmatizer is impeccable — it managed to process man and supercars. Finally, let's see how Spacy lemmatizes verbs:

for word in text:
    print(word.text, ' --> ', word.lemma_)


#  driven  -->  drive
#  swallowed  -->  swallow
#  chewing  -->  chewing
#  got  -->  get
#  are  -->  be
#  blew  -->  blow
#  saw  -->  see

Although Spacy couldn't lemmatize chewing, it is still the best result.

Other implementations

NLTK is not the only library that has implementations of text normalization algorithms. Below is a list of libraries where you can find implementations of text normalization for English:

Hunspell (stemming);
Gensim (stemming);
SpaCy (lemmatization);
TextBlob (lemmatization);
Pattern (lemmatization).

Conclusion

In this topic, we have learned about text preprocessing and the role of text normalization in it, the difference between two main approaches (stemming and lemmatization), and how to implement some algorithms using NLTK. Let's recap:

Text normalization is an important step of text preprocessing. It reduces various word forms to one single form;
There are two approaches to text normalization: stemming removes affixes according to some rules and keeps the stem, while lemmatization analyzes the word and returns its lemma with the help of a dictionary;
Both stemming and lemmatization have their advantages and disadvantages. Stemming works faster than lemmatization but the latter is usually more precise and always returns a real word.

You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.

16 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo