Table of contents
Text Link

Mastering Stemming and Lemmatization

For centuries, humans have dreamed of creating a thinking being. This desire can be traced back to ancient stories like the Greek myth of Talos and the Jewish legend of Golem. In the modern age of computers, this ambition has resurfaced in the form of attempts to develop a computer program that can communicate with people naturally and conversationally. From the iconic Eliza program to the various chatbots we interact with daily for customer support and last year’s ChatGPT sensation (known as the killer of search engine optimization), the creators of these systems strive to make them pass the Turing test with flying colors. This means the program should be so convincing that a person interacting with it cannot distinguish between chatting with another human or a computer program. To achieve this, all such conversational programs must use stemmer algorithms and data types developed within a field of knowledge called NLP—Natural Language Processing.

Natural Language Processing

Definition

Natural Language Processing, or NLP, is an interdisciplinary field of science focusing on the interaction between computers and actual human language. NLP involves the development of algorithms and models to enable computers to understand, interpret, and generate human language in a meaningful and helpful way. The goal is to allow machines to comprehend and respond to human language like humans and perform tasks such as language translation, sentiment analysis, speech recognition, information retrieval, question answering, text summarization, and more.

To achieve this goal, NLP encompasses various techniques and methodologies drawn from linguistics and computer science, including statistical and machine learning approaches, deep learning, natural language understanding (NLU), natural language generation (NLG), and computational linguistics.  

Depending on the task and requirements, the steps may vary, but in typical language applications of stemming, we can encounter the NLP processing pipeline consisting of the following stages:

Typical NLP pipeline
Typical NLP pipeline

Each of these stages can be divided into individual steps: for example, for the cleaning and preprocessing stage, we can distinguish:

  • Removing irrelevant characters: HTML tags, emojis, potential random characters, etc.
  • Unicode variant characters and ligatures normalization
  • Case normalization (e.g., lowercasing)
  • Sentence segmentation
  • Word tokenization
  • Part of Speech (POS) Tagging
  • Stemming/Lemmatization
  • Stop words identification
  • Named Entity Recognition (NER)

In the rest of this article, we will focus on the Stemming and Lemmatization phase.

Stemming and Lemmatization

In essence, the idea behind both operations is the same:

to reduce words to their base (canonical/dictionary) forms,
to reduce the vocabulary and simplify text processing.

To better understand this concept and distinguish between the two processes, let’s begin by introducing specific linguistic terms (which should not be confused with their usage in other fields of study) as defined in computational linguistics.

  • Inflection is the process of word formation to express grammatical information
  • Stem is an uninflected part of a word, the part of the word that never changes,
    even morphologically, when inflected
  • Affix is a morpheme attached to a word stem to form a new word or word form.
    Affixes added at the beginning are called prefixes. In the middle—infixes,
    and at the end suffixes (sometimes the prefixes and suffixes are both named adfixes, in contrast to infixes)
  • Lexeme is a single word that can have various inflected forms. These forms are still considered variations of the same underlying word, and together they convey a unit of lexical meaning. Inflectional forms connect related words to this basic abstract unit of meaning.
  • Lemma is the canonical/dictionary or citation form of a set of word forms

 The stem is usually not distinct from the lemma in languages with very little inflection. Still, word stems may rarely or never occur independently in languages with more complicated rules.

For example, in English, the stem of a group of words: fishing, fished, and fisher is fish, which is a separate word too, but for the group of words: argue, argued, argues, arguing, and argus, the Porter algorithm finds the stem -argu-.

As we can see, both these stems are obtained by simple, some could say, rote actions of removal of -ing, -ed, -er, -e, -ed, -es, or -s suffixes.

On the other hand, a lemma refers to the particular form chosen by convention to represent the lexeme, and the process of choosing that word is at least partly arbitrary.

For example, for nouns, the singular, non-possessive shorter form is often choosen (e.g., rather mouse than mice), and for verbs – the infinitive.

After some theory, let’s get our hands dirty and move on to practical exercises.

New PyCharm Project for NLP Tasks

The NLP tasks we’ll do will be largely interactive, a bit more interactive than regular Python projects in the PyCharm IDE. The Jupyter Notebooks work great as an environment for such interactive tasks. Of course, PyCharm provides support for these notebooks. At the stage of creating a new project, the easiest way to get them is by creating a new scientific project instead of a regular Python project—in the New Project window, choose the Scientific project type from the list on the left, specify the name and location of this new project (StemmAndLemma, in this case), and choose the management system to create the project’s virtual environment. The Conda environment would be the best choice for real scientific projects because it prepares and is ready out of the box many packages useful for data processing and visualization tasks. Still, most of them are redundant, so we can prepare our environment using the Virtualenv manager.

PyCharm new project for NLP

After creating a new scientific project, in the PyCharm Project window on the left (accessible, for example, via the Alt-1 keyboard shortcut), we’ll see the directory structure adapted to such tasks – with the pre-created data, models, and notebooks folders.

To create a new notebook, right-click the notebooks folder, from the context menu, choose New🡪Jupyter Notebook, and in the next window, write the name for this new notebook. After performing all these steps, you’ll see a new *.ipynb file in the notebooks folder:

 

Obraz zawierający tekst, zrzut ekranu, oprogramowanie, Oprogramowanie multimedialneOpis wygenerowany automatycznie

This article won’t explain Jupyter Notebooks in-depth, but you can find more information in the Further Reading section. Simply put, a notebook contains cells that allow you to include markdown text or code written in various programming languages such as Python, R, or Julia. You can run the code and display its output. Plus, you can easily share your notebooks with others.

Obraz zawierający tekst, zrzut ekranu, oprogramowanie, Oprogramowanie multimedialneOpis wygenerowany automatycznie

Let’s get back to our task. If you encounter a warning that Jupyter is not installed, click on the Install Jupyter button. Once it’s done, let’s input the text we will work on into the notebook’s cell. For instance:

Obraz zawierający tekst, zrzut ekranu, Oprogramowanie multimedialne, oprogramowanieOpis wygenerowany automatycznie

This text is the English translation of the beginning of Julius Caesar’s book The Gallic Wars, written during the Roman invasion of the Gallia in the 1st century BC. 

Because Stemmers and Lemmatizers generally work on the words, not entire texts, we must first split this source text into individual words.

The exact workflow depends on task we’ve to perform. For example, if we need to know the most common words used to start and end the sentences, we have to split the text into sentences first, not words.

“What’s the problem” you may think—use the Python split() function. Not exactly…

This function returns a list of some contaminated words – with punctuation marks glued to them, like here:

[ …, 'inhabit,', …, 'another,', …, 'Celts,', …, 'Gauls,', …, 'third.', …, 'language,',
  …, 'laws.', …, 'Aquitani;', …, 'Belgae.', …, 'these,', …, 'bravest,', …, '[our]',
  'Province,', …, 'them,', …, 'mind;', …, 'Germans,', …, 'Rhine,', …, 'war;', …, 
  'valor,', …, 'battles,', …, 'territories,', …, 'frontiers.' ]

So… we have to use something else. And the answer is tokenization.

Text Tokenization

Tokenization is demarcating and possibly classifying sections of a string of input characters. As it was already mentioned above, we can choose between two main versions:

  • Sentence tokenization, and
  • Word tokenization

Both can be done with the aim of the functions provided by the NLTK – Natural Language Toolkit – package. It is a de facto standard in the field of NLP. We need to have this package installed in our virtual environment to access its tools. If it is not present yet, probably the easiest way to do that is to write the command in one of the cells of our Jupyter Notebook opened in the PyCharm IDE:

import nltk

The name of the package will be underlined in red, and the message No module named ‘nltk’ will be displayed as an error description along with the solution proposed:

 

Obraz zawierający tekst, zrzut ekranu, Oprogramowanie multimedialne, CzcionkaOpis wygenerowany automatycznie

Install package nltk – click that suggestion or press Alt-Shift-Enter.

To use the available tokenizers, we can choose one of two ways:

  • From the nltk.tokenize package import the class of the selected tokenizer – e.g., NLTKWordTokenizer or RegexpTokenizer, if necessary configure it and/or adapt it to your needs, and then use its “tokenize” function, or
  • Use one of the preconfigured wrapper functions:
    sent_tokenize, word_tokenize, or wordpunct_tokenize
    The first splits the text into sentences and the others into words.

We’ll use the second stemming method, as it is much simpler. But which of the functions should we use?

The word_tokenize and wordpunct_tokenize functions differ in computational speed, accuracy, and the tokenizer class used under the hood:

  • The wordpunct_tokenize function under the hood uses the RegexpTokenizer,
    which is fast but doesn’t delve into various linguistic intricacies,
  • The word_tokenize function uses NLTK’s recommended word tokenizer, an improved TreebankWordTokenizer, and PunktSentenceTokenizer for the specified language. It results in better accuracy but requires more time to complete the task.

To compare their results, we can use set operations:

wordpunct_tokens = set(nltk.tokenize.wordpunct_tokenize(source_txt_EN))
 word_tokens = set(nltk.tokenize.word_tokenize(source_txt_EN))
 word_tokens_with_lang = set(nltk.tokenize.word_tokenize(source_txt_EN, 'english'))

 print(wordpunct_tokens == word_tokens)
 print(word_tokens_with_lang == word_tokens)

And how to check execution speed? The Jupyter Notebook offers two “magic commands” which aim to measure the execution time of one or many lines of code – these commands are %timeit and %%timeit, respectively:

To measure the time of execution of some command, it is enough to precede it in the notebook cell with the “%timeit” magic command – like here:

Obraz zawierający tekst, zrzut ekranu, Oprogramowanie multimedialne, CzcionkaOpis wygenerowany automatycznie

As we can see below the cell with the timed command, the measured execution time is an average of some number of runs (7, by default) of some batches of plenty of loop circulations (10000, by default).

The wordpunct_tokenize function (or RegexpTokenizer under the hood) is relatively fast. The time required to take the measure is not very long, about 4.5 seconds, but with more time-consuming commands, it may take forever just to measure their execution time. To avoid it, we can customize the %timeit magic command with its parameters:

  • -r’ – the number of runs, and
  • -n’ – the number of loop circulations per batch run

In our case, the values -r 5 and -n 2000 seem reasonable.

So, let’s try to measure the execution time of both versions of the “word_tokenize” function:

Obraz zawierający tekst, zrzut ekranu, oprogramowanie, Oprogramowanie multimedialneOpis wygenerowany automatycznie

As we can see, the function word_tokenize (the classes TreebankWordTokenizer and PunktSentenceTokenizer under the hood) are about fifteen times slower than the wordpunct_tokenize function. In our naive case, it doesn’t matter, but it can make a big difference with a lot of source text to process. Therefore, it is worth being aware of this distinction.

OK, but what do we get after word tokenization?

We expect the uncontaminated separate words, but let’s confront our expectations with reality (it’s a good practice to check the results of the processing steps; it helps to catch possible errors at the earliest possible stage).

To obtain a clearer perspective, we can categorize the words based on their length and subsequently arrange each list in the order:

words_by_len = {}
 for w in word_tokens:
     words_by_len.setdefault(len(w), []).append(w)
 print("Len  Num  Words")
 for k in sorted(words_by_len.keys()):
     v = words_by_len[k]
     print(f"{k:02} : {len(v):02} : {sorted(v, key=lambda x: x.lower())}")

After running this code, we get the following:

Great! In contrast to the naïve text splitting, the words are not contaminated now.

Do we need all these words in further processing steps?

From the point of view of these further processing steps, is it better to leave these words as they are, or should we transform them all to lowercase versions?

Stop Words Removal

For example, all the words of length 1 are separated punctuation marks. We don’t need them anymore.

Many short words, including those with only 2 or 3 letters, are commonly called stop words and do not carry significant meaning. They can be easily removed from the text.

To utilize the stop words approach, we must import it from the language corpus before its initial use. Please note that this approach is highly dependent on the language used.

from nltk.corpus import stopwords
 stopwords_EN = stopwords.words('english')

Of course, we can expand this list with our custom stop words or replace it completely with others.

The list of languages with the stopwords defined, available in the nltk package, we can get using the following command:  print(stopwords.fileids()) and additional corpora (along with other additional nltk resources) we can review and install in our system with the command: nltk.download()

Right now, the predefined list of stop words is sufficient for our needs…

Let’s apply that knowledge and clean a bit the list of words to be processed:

word_tokens_cleaned =
     [w for w in word_tokens if len(w) > 1 and w.lower() not in stopwords_EN]

As a result, in place of 98 original word tokens, we get a shortened list of 60 words
(here grouped by length with the aim of code similar to the one used before):

Obraz zawierający zrzut ekranu, tekst, CzcionkaOpis wygenerowany automatycznie

Great, we’ve removed all that meaningless chatter words and got only the meaningful ones remaining.

We can finally have stemming and lemmatization operations applied to our list of words.

Stemming

As previously discussed in this article, stemming refers to the technique of reducing inflected or derived words to their common root or base form, which may not necessarily be identical to the morphological root of the word.

The NLTK toolkit offers various ready-to-use stemmers in the nltk.stem module. Some are universal; some are dedicated to specific languages. For example, nltk.stem.rslp.RSLPStemmer is designed for Portuguese, the nltk.stem.cistem.Cistem for German, and nltk.stem.isri.ISRIStemmer for the Arabic lemmatization algorithm.

For the English language, we can take a look at the stemmers:

  • LancasterStemmer
  • RegexpStemmer
  • Porter Stemmer
  • Snowballstemmer

All these Stemmers’ classes are based on the abstract class StemmerI defined in the nltk/stem/api.py file. This abstract class declares only one abstract method, stem:

@abstractmethod
 def stem(self, token):

The stemmers should be defined within the concrete classes and can be utilized to stem passed words or tokens when these classes are instantiated. Let’s examine each of these stemmers briefly.

The LancasterStemmer

This stemmer is based on the Lancaster (Paice/Husk) stemming algorithm. It is rule-based and uses almost 120 predefined stemming rules stored in the default_rule_tuple constant defined in the nltk/stem/lancaster.py file. You can use them or provide your custom rules passed as the optional rule_tuple= argument. The second optional argument (default to false) tells the stemmers if the prefixes of the stemmed words should be kept or removed. For example, the LancasterStemmer created as:

 lcst = LancasterStemmer()

Applied to the word kilometer:

print(lcst.stem('kilometer'))

produces the output kilomet; only the suffix is stripped while created and used as:

lcst2 = LancasterStemmer(strip_prefix_flag=True)
 print(lcst2.stem('kilometer'))

gives us only the met as the result of stemming.

In general, the LancasterStemmer with the predefined rules is very aggressive when stripping the affixes— the stems it produces are shorter than those obtained by using other stemmers, but this may lead to some errors/mistakes, as we’ll see in a while.

The RegexpStemmer

The RegexpStemmer requires some configuration before use; you must provide the regular expression describing the parts which are to be removed from the stemmed words, for example, when you define that stemmer as:

 from nltk.stem import RegexpStemmer
 rgst = RegexpStemmer('^c|ing$|s$|e$|able$')

and use it to stem the words cars, has, and have, like here:

print(rgst.stem('cars'));  print(rgst.stem('has'));  print(rgst.stem('have'))

you get the result:

ar”, “ha”, and “hav

This is because the initial parameter passed to the RegexpStemmer class constructor contains five sections separated by the OR operator (|) of a regular expression. Therefore, any parts of the stemmed words that match these sections will be eliminated.

  • The ^c section matches the letter c located and the beginning of the word,
  • And the …$ sections match the phrases located at the end of stemmed words –
    here the -ing, -s, -e, and -able, respectively

When defining the RegexpStemmer, you can set the minimum number of characters required to apply a stemmed word. For instance:

rgst = RegexpStemmer('^c|ing$|s$|e$|able$', min = 4)

If you use this version of RegexpStemmer to stem the exact words as before: cars, has, and have, only the words cars and have will be stemmed, and the word has will not be stemmed (it will be left as it is) because its length is less then four characters. So, the result in this case will be:

ar, has, and hav

The RegexpStemmer is the most flexible, but you must have good regular expression knowledge and experience to use it efficiently.

The PorterStemmer and The SnowballStemmer

These two stemmers utilize Martin Porter’s stemming algorithm, which was developed in 1980, well before the creation of the Python language. The original algorithm is now considered frozen and cannot be further modified. Presently, two primary implementations of this algorithm are widely used.

  • The stemmer is known as nltk.stem.PorterStemmer with the improvements to the original algorithm proposed by the NLTK package contributors; it can be accessed with the default parameter mode=NLTK_EXTENSIONS
  • The stemmer is sometimes addressed as the Porter2, but better known as the SnowballStemmer, available in the NLTK package as nltk.stem.SnowballStemmer, which requires passing the mandatory language parameter, or as its localized versions:
     EnglishStemmer, SpanishStemmer, FrenchStemmer, etc.
    The optional parameter of the SnowballStemmer and its localized versions, the ignore_stopwords parameter, does not remove the stop words from the result but only excludes them from stemming when set to True, and the stemmer returns their unmodified version in that case (its default value is False, which means that the stop words are stemmed either by default).

The Duel of The Stemmers

OK, now that we’ve got acquainted with the main stemmers available in the NLTK package, it’s time to compare them against each other. To the tournament lists, we invite:

  • The famous Stemmer of Lancaster,
  • The Stemmer of The House of Porter, and
  • The EnglishStemmer of The Snowball Kingdom

Unfortunately, The Stemmer of the Regexp Duchy was disqualified due to the overly demanding pre-configuration it requires to run.

First, we instantiate each of the Stemmers classes:

lc_st_f = nltk.LancasterStemmer(strip_prefix_flag=False)
 lc_st_t = nltk.LancasterStemmer(strip_prefix_flag=True)
 pt_st = nltk.PorterStemmer()
 sb_en_st = nltk.stem.snowball.EnglishStemmer()

Then we can define the stemming function, which utilizes the stemmer passed as a parameter:

def stem(stemmer, words):
     return [stemmer.stem(w) for w in words]

We can use it to get the desired outcome. As the second parameter, we can utilize the word_tokens_cleaned list obtained after eliminating the stop words.

lc_stems_f  = stem(lc_st_f,  word_tokens_cleaned)
 lc_stems_t  = stem(lc_st_t,  word_tokens_cleaned)
 pt_stems    = stem(pt_st,    word_tokens_cleaned)
 sb_en_stems = stem(sb_en_st, word_tokens_cleaned)

To present the obtained results in a nice tabular form, we can zip the individual lists into one list of tuples:

 all_stems = sorted(
                 zip(word_tokens_cleaned, lc_stems_f, lc_stems_t, pt_stems, sb_en_stems),
                 key=lambda x: x[0].lower()
             )

And then print it, for example, like that:

 print(" Original word : Lanc. F   Lanc. T   : Porter     Snowball")
 print("=" * 60)
 for s in all_stems:
     print(f" {s[0]:13} : {s[1]:9} {s[2]:9} : {s[3]:10} {s[4]}")

As a result, we get the table looking like this:

Obraz zawierający tekst, zrzut ekranu, menuOpis wygenerowany automatycznie

The first column contains the tokenized words from our source text without the stop words which were removed; the second one—the stems got with the LancasterStemmer with the default value of the strip_prefix_flag (False); and the third column—the stems got with the same LancasterStemmer, but with its optional parameter set to True. And the 4th and 5th columns contain the results of the PorterStemmer and the EnglishStemmer from the Snowball Stemmers family, respectively.

As we can see, all the stemmers used lowercase the stems they output. In addition, for our source text, there’s no difference between the two variants of LancasterStemmer outputs, and similarly, there’s no difference between the output of PorterStemmer and snowball.EnglishStemmer, too.

These pairs of stemmers may produce different results, as we could see with the LancasterStemmer and the sample word kilometer. Still, the results within these pairs of stemmers are the same in our case.

In addition, we can observe that for 36 of the processed words, 60%, all four stems got are identical, and in 24 cases, 40%, there are differences between the stems with the Lancaster stemmer and the Porter/Snowball stemmers.

And what about the processing speed? We can use the already known %timeit function available in the Jupyter Notebook cells:

 

Obraz zawierający tekst, zrzut ekranu, oprogramowanie, Oprogramowanie multimedialneOpis wygenerowany automatycznie

As we can see, all examined stemmers run equally fast (or slow), so the execution time is not the point.

So… which stemmer should we choose?

For this particular text – probably the PorterStemmer or the snowball.EnglishStemmer.

Why?

At least for me, the LancasterStemmer applied to the words taken from our source text seems too aggressive. For example, the following stems, in my opinion, may lead to errors/confusions at later processing stages:

  • cal obtained from the word called may lead to mess with the word calendar,
  • germ taken here from the word Germans may lead to confusion with the word germ/ovule or germ/ebryo, or even germs/bacteria, which are completely out of this fairy tale…
  • on as a stem of one may be confused with the stop word on,
  • sep derived from separates may be easily confused with September
  • val taken from valor may be confused with value

The corresponding stems produced by the PorterStemmer/EnglishStemmer seem less confusing.

Lemmatization

The WordNetLemmatizer

Let’s explore the lemmatization method as a foundation for further processing to avoid misunderstandings. Lemmatization provides us with the dictionary form of words, and unlike stemming, the resulting lemma is always a valid word. This is because the process uses a dictionary known as WordNet, which is accessible through the NLTK package. If you haven’t downloaded it yet, use the command nltk.download(‘wordnet’) before your first use. Next, import and instantiate the lemmatizer.

nltk.download('wordnet')

After that, we have to import and instantiate the lemmatizer class; there’s only one in the NLTK package:

 from nltk.stem import WordNetLemmatizer
 wnl = WordNetLemmatizer()

And now we can use its lemmatize function, for example, this way:

lemmas = [wnl.lemmatize(w) for w in word_tokens_cleaned]

What we are most interested in is the comparison of the lemmatization results with the stems obtained earlier - whether we will get any improvement in accuracy or maybe speed optimization.

So, let’s compare the lemmas obtained to the stems produced by the snowball.EnglishStemmer, which was more accurate in our case, and filter only positions that differ in these two cases:

stemms_and_lemmas = sorted(
    zip(word_tokens_cleaned, sb_en_stems, lemmas), key=lambda x: x[0].lower()
)
print(" Original word : SnowballSt.: WordNetLemm.")
print("=" * 45)
for t in stemms_and_lemmas:
    if t[1].lower() != t[2].lower():
        print(f" {t[0]:13} : {t[1]:10} : {t[2]}")

As a result, we get the table with 23 rows with stems and lemmas which differ:

 rows with stemms and lemmas

At first glance, the following words seem to be lemmatized incorrectly: called, Celts, Gauls, Germans, waging.

Paradoxically, these words seem to be processed better by SnowballStemmer than by WordNetLemmatizer

On the other hand, the lemmas of the words: another, battle, continually, daily, divided, effeminate, Garonne, language, Marne, Province, refinement, Seine, separate, and territory look more human-friendly than their stems because of the difference between the principles of operation between stemmers and lemmatizers.

But are the lemmas better than the corresponding stems?

It depends on the next processing steps. When the dictionary forms are strictly needed, the lemmatizer is an obvious choice over the stemmers. When only the simplified (in any way) versions of the words are required, stemmers may be a good alternative.

And what about the processing speed? We can use the %timeitcommand:

%timeitcommand

 As we can see, the WordNetLemmatizer’s processing speed (at least with our source text) is about 4 times better than the Stemmers. It makes no difference in our case, but for large texts to be processed, it may be a point.

We can see that stemmers lowercase the processed words, while the lemmatizer does not.

Let’s see whether the prior lowercasing of the words to be processed changes anything in its behavior.

There are twelve capitalized words in our source text:

'Aquitani' 'Belgae' 'Celts' 'Garonne' 'Gaul' 'Gauls' 'Germans' 'Helvetii' 'Marne' 'Province' 'Rhine' 'Seine'

Their direct lemmas are:

'Aquitani' 'Belgae' 'Celts' 'Garonne' 'Gaul' 'Gauls' 'Germans' 'Helvetii' 'Marne' 'Province' 'Rhine' 'Seine'

This looks to be exactly the same list. When we apply the lemmatization to the previously lowercased words, we get:

'aquitani' 'belgae' 'celt' 'garonne' 'gaul' 'gaul' 'german' 'helvetii' 'marne' 'province' 'rhine' 'seine'

As we can see, the words Celts, Gauls, and Germans were changed to their singular forms.

It’s up to your decision if it is required from the point of view of the further processing steps or not…

Pre-Lemmatization POS-tagging

The WordNetLemmatizer offers one more possibility to fine-tunning the results it produces.

The lemmatize method accepts an optional pos parameter that specifies the part of speech we classify the word to be lemmatized. The possible values of this parameter are:

  • a for adjectives,
  • n for nouns,
  • r for adverbs,
  • s for satellite adjectives,
  • v for verbs,

with the default value n.

Does it help in any way?

Yes. For example, the lemmatization of the word are gives us different results without and with this optional POS tag:

print(wnl.lemmatize('are'))		🡪 returns ‘are’
print(wnl.lemmatize('are', 'v'))		🡪 returns ‘be’

Similarly, for the word waging which is present in our source text, we get different results too:

print(wnl.lemmatize('waging))		🡪 returns ‘waging’
print(wnl.lemmatize('waging', 'v'))	🡪 returns ‘wag’

But in this case – surprisingly – it leads to confusion – we say to wage war, NOT to wag war.

Of course, it is possible to POS-tag the lemmatized words by hand, but the NLTK package offers various taggers and the pos_tag(…) function, which can be utilized to do this task.

It is beyond the scope of this article, but it is worth mentioning, that the POS tags used by the WordNetLemmatizer’slemmatize” method (a, n, r, s, v) differ from the POS tags returned by the „nltk.pos_tag(…)” function, and when we want to do the pre-lemmatization POS-tagging with the aim of that function, the conversion of the POS-tags is required.

In our case, only one word (waging) is affected by the pre-lemmatization POS tagging, and it’s been swapped for the worse…

Anyway, after all these steps, we have five lists of stemmed/lemmatized words:

lc_stems_f, lc_stems_t, pt_stems, sb_en_stems and lemmas.

Which are three lists:

  • Stems produced by the LancasterStemmer: lc_stems_f and lc_stems_t are identical;
  • Stems produced by the Porter/Snowball Stemmers: pt_stems and sb_en_stems are identical;
  • Lemmas produced by the WordNetLemmatizer.

We can start doing other NLP tasks. For example, creating the BOW (bag of words) or N-Grams, doing the TF-IDF vectorization, feature extraction, model building, performing the sentiment analysis, and/or any other NLP tasks which are needed and/or required.

But that’s a whole other story…

Conclusion

This article discusses the stemming and lemmatization processes, which are parts of the NLP workflow. We’ve spoken about the linguistic theory topics related to these processes and presented some of the Stemmers and the Lemmatizer from the NLTK package – the de facto standard in the NLP area.

We’ve shown how to prepare the source text before the actual stemming and/or lemmatization is possible. We also drew attention to some problems when working with these tools and compared them, considering their accuracy and procession speed.

Further reading

Wikipedia articles devoted to NLP topics:

The Hyperskill Sample NLP-related Topics:

Selected libraries/services used to perform NLP tasks (in alphabetical order):

PyCharm Help Topics:

Other documentation

Sources of external properties used in the article

Images

Share this article
Get more articles
like this
Thank you! Your submission has been received!
Oops! Something went wrong.

Create a free account to access the full topic

Wide range of learning tracks for beginners and experienced developers
Study at your own pace with your personal study plan
Focus on practice and real-world experience
Andrei Maftei
It has all the necessary theory, lots of practice, and projects of different levels. I haven't skipped any of the 3000+ coding exercises.