Computer scienceData scienceNLPText processingText normalization

Lemmatization

7 minutes read

Text normalization is a crucial step in preprocessing data to enhance the performance of machine learning models. One effective method of text normalization is lemmatization.

Lemmatization use cases

Lemmatization involves reducing words to their base or root form, which is known as a lemma. A lemma represents a valid word form and is often referred to as a dictionary form or canonical form. When you look up a word in the dictionary, the lemma is the form you typically find. For instance, the words "did," "done," "doing," and "do" all share the common lemma "(to) do."

Lemmatization plays a significant role in inflective languages like Latin, Russian, and Hungarian. Even in English, lemmas are useful for handling irregular verbs and different forms of adjectives. For example, the lemma for "went" is "go," and the lemma for "better" is "good."

Lemmatization in NLTK

To utilize the lemmatizer from the NLTK library, ensure that you have access to WordNet by executing nltk.download('wordnet'). WordNet is a comprehensive lexical database of the English language, which the NLTK lemmatizer relies on for lemmas. NLTK offers a single lemmatization algorithm.

To begin, import the WordNetLemmatizer class from the nltk.stem module and create an instance of the class. Then, employ the lemmatize() method, providing the word you wish to lemmatize as an argument.

from nltk.stem import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('playing')  # playing

In the example above, the word remains unchanged after lemmatization. Typically, lemmatizers require knowledge of the word's context or part of speech. In this case, the default part of speech is a noun, and as such, the word 'playing' is already its own lemma. Refer to the chart below for the part-of-speech tags available in WordNet:

Part of speech

Tag

Noun

n

Verb

v

Adjective

a

Adverb

r

To assign the appropriate tag to each word, you need to match the part of speech for your word.

lemmatizer.lemmatize('playing', pos='v')  # play
lemmatizer.lemmatize('plays')             # play

When lemmatizing a text, manually tagging every word is not feasible. Therefore, it is necessary to define a function that automatically assigns a part-of-speech tag to each word.

It's important to note that the part-of-speech tags must be consistent with those used in WordNet. If the tags obtained from POS-tagging do not align with WordNet's tags, a conversion is necessary. For instance, the default NLTK tagger (nltk.pos_tag()) uses tags like "NN" for "Noun, singular or mass" and "VP" for "Verb, the base form". To lemmatize your text, you'll need to create a function that converts these tags to the WordNet tags:

def get_wordnet_tags(pos):
    if pos == 'NN':
        return 'n'
    elif pos == 'VP':
        return 'v'
    # and so on

Now, let's examine how WordNetLemmatizer handles various parts of speech:

words = ['effective', 'dangerous', 'careful', 'monetary', 'kind', 'supportive', 'rarer', 'rarest']
for a in words:
    print(a, ' --> ', lemmatizer.lemmatize(a, pos='a'))


#  effective  -->  effective
#  dangerous  -->  dangerous
#  careful  -->  careful
#  monetary  -->  monetary
#  kind  -->  kind
#  supportive  -->  supportive
#  rarer  -->  rare
#  rarest  -->  rare

Comparative and superlative adjectives were accurately lemmatized, while all other adjectives remained unchanged, which is the expected behavior.

Lemmatization in Spacy

Spacy offers a lemmatizer as its sole functionality. To utilize this lemmatizer, you need to download the corresponding language model. Here's how you can do it:

import spacy

nlp = spacy.load('en_core_web_sm')

Now, you can proceed to lemmatize a text. Let's attempt to lemmatize a list of adjectives. Keep in mind that the input should be in raw text format.

text = nlp('effective dangerous careful monetary kind supportive rarer rarest')

for word in text:
    print(word.text, ' --> ', word.lemma_)


#  effective  -->  effective
#  dangerous  -->  dangerous
#  careful  -->  careful
#  monetary  -->  monetary
#  kind  -->  kind
#  supportive  -->  supportive
#  rarer  -->  rarer
#  rarest  -->  rarest

Lemmatization pros and cons

Although lemmatization is a highly beneficial technique for text normalization, it does have its drawbacks. One of the main disadvantages is that it takes longer than stemming to process. Furthermore, lemmatization can encounter challenges when dealing with ambiguous words that have multiple possible lemmas.

Additionally, the development of accurate and efficient lemmatization algorithms for every language is challenging due to the linguistic rules and complexities that vary across different languages. However, lemmatization offers the advantage of obtaining valid words as results and recognizing cases of suppletion in irregular verbs.

Conclusion

This topic covered text preprocessing and the significance of text normalization. It helped you explore the concept of lemmatization and its implementation using NLTK and Spacy libraries. Lemmatization is especially valuable in inflective languages and instances where irregular verbs or adjectives are involved. It transforms words to their base form (lemma) and offers several advantages, including enhanced text analysis, improved data quality, reduced vocabulary, and better language comprehension.

You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.

8 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo