Computer scienceData scienceNLPText processingText normalization

Stemming

8 minutes read

Most texts include words of different forms. When you use various machine learning algorithms such as TF-IDF, you want to have similar embeddings for similar words like "run", "runs", "running" or "runner". One way to achieve such a performance is text normalization. In this topic, you will explore the text normalization technique known as stemming. The goal of text normalization is to simplify different variations of a word to a single form. By using algorithms, words like "plays", "playing", and "played" are transformed into "play", allowing us to apply count-based models for improved performance.

Before applying stemming or lemmatization, it is recommended to tokenize your text and remove any digits and punctuation marks.

Stemming use cases

Word stem, also known as the word root, is a crucial component of an inflected word as it carries the word's lexical meaning. It's important to note that a word stem may not be an actual word itself, but rather a word form without any affixes attached to it. Stemming, on the other hand, refers to the process of removing the last few characters from a word, often leading to inaccurate meanings and spellings.

For example, the words "laughing," "laughs," "laughed," and "laughter" all share the common stem "laugh." Similarly, the words "swimming" and "swim" share the common stem "swim." It's worth noting that a word like "swam" would not have the same stem form as "swim." Additionally, stemming can sometimes result in the same stem for words with different meanings. For instance, both "cars" and "caring" would have the same stem "car."

Ways to stem

There are various methods for carrying out stemming, including complex techniques like NLTK and other NLP libraries that are based on a few key approaches.

One common method is suffix stripping, which involves using a list of suffixes specific to a particular language. For example, if we identify the word "played" as a verb and know that in English, "-ed" endings are common, we can strip off this suffix. Similarly, if we have the word "increased," a stemmer will recognize it as a verb and identify the "-ed" ending, resulting in the stem "increas." Some approaches do not necessarily require the base form of the word to match the actual word form, and this is where Snowball Stemmer, for instance, comes in. Alternatively, a stemmer could generate a real word form in the output. In this case, we can create a list of all possible word base forms. The stemmer will remove the last letter and check if the resulting form matches any word in the list, repeating this process until it obtains a valid word form. In the example above, the stemmer will delete the first letter from the right side just once: increase.

Another method of stemming is suffix changing. With this approach, a popular ending can be replaced with another. For example, the noun "friendliness" can be stemmed as "friendly" by replacing the ending "-liness" with "-ly". However, it's important to note that most stemmers are not as advanced, so the result may only be "friendli".

Stemming in NLTK and gensim

Let's learn how to carry out stemming using NLTK, a powerful toolkit for natural language processing. NLTK offers various algorithms for stemming, including the widely used Porter stemmer and Lancaster stemmer, specifically designed for the English language. These algorithms, along with others, can be found in the nltk.stem module.

The Porter stemmer is the earliest and most popular algorithm for stemming words. To utilize it, simply import the PorterStemmer class from the nltk.stem module and create an instance of this class. Keep in mind that the Porter stemmer is designed for English words only. Once you have created the object, you can call the .stem() method on a word to obtain its stemmed form.

from nltk.stem import PorterStemmer


porter = PorterStemmer()
porter.stem('played')   # play
porter.stem('playing')  # play

For improved results, the Snowball stemmer can be considered as an enhancement to the original Porter stemmer. NLTK's SnowballStemmer class also supports 13 non-English languages, such as Spanish, French, Russian, German, Swedish, and more. To utilize this algorithm, simply create a new instance of the class and specify the desired language.

from nltk.stem import SnowballStemmer


snowball = SnowballStemmer('english')
snowball.stem('playing')  # play
snowball.stem('played')   # play

The Snowball stemmer outperforms the Porter stemmer, as we mentioned earlier. Let's compare the following examples:

snowball.stem('generously')   # generous
porter.stem('generously')     # gener

snowball.stem('dangerously')  # danger
porter.stem('dangerously')    # danger

With the Porter stemmer, both the affix "-ly" and "-ous" would be removed from the input, as seen in the word "dangerously". However, this is unnecessary and incorrect. On the other hand, the Snowball stemmer produces a more accurate result for the word "generously".

NLTK also offers the Lancaster or Paice-Husk stemming algorithm. To utilize the Lancaster stemmer, you should follow the same import process as before, but this time you import the LancasterStemmer class from the nltk.stem module.

from nltk.stem import LancasterStemmer


lancaster = LancasterStemmer()  
lancaster.stem('played')       # play
lancaster.stem('playing')      # play
lancaster.stem('generously' )  # gen
lancaster.stem('dangerously')  # dang

When it comes to stemmers, all of them are quite similar. However, the original Porter stemmer and Snowball stemmer stand out by providing better results. On the other hand, if you are working with extensive text data and need to process it quickly, the Lancaster stemmer works faster. So, depending on your specific needs, choose the Lancaster Stemmer for efficient processing or opt for the Snowball or Porter stemmers for more accurate results.

Now, let's explore how the Snowball Stemmer handles different parts-of-speech, starting with adjectives.

adjs = ['effective', 'dangerous', 'careful', 'monetary', 'kind', 'supportive', 'rarer', 'rarest']
for a in adjs:
    print(a, ' --> ', snowball.stem(a))


#  effective  -->  effect
#  dangerous  -->  danger
#  careful  -->  care
#  monetary  -->  monetari
#  kind  -->  kind
#  supportive  -->  support
#  rarer  -->  rarer
#  rarest  -->  rarest

The Snowball Stemmer converts most adjectives into nouns, except for comparative, superlative forms, and those ending with "-ed". Additionally, the gensim library can be utilized for applying stemming. To accomplish this, import the gensim library and instantiate an instance of the PorterStemmer class. To obtain a stem, simply apply the .stem() method to the desired word.

from gensim.parsing.porter import PorterStemmer

p = PorterStemmer()
p.stem('apple') #appl 
p.stem('played') #plai
p.stem('playing') #plai
p.stem('generously') #gener
p.stem('dangerously') #danger

Conclusion

Stemming, a popular algorithm, comes with its own set of pros and cons. On the downside, it may yield stems that are not actual words and can also lead to the same stem being generated for words with different meanings. However, stemming offers several advantages. It excels in handling large datasets, ensuring fast processing. It also delivers reliable outcomes for certain languages and requires minimal memory usage.

To summarize, stemming is a text normalization technique that simplifies words to their base form, enhancing information retrieval and analysis. Employing stemming as a normalization technique can significantly accelerate the process of text normalization for vast amounts of text.

You can find more on this topic in Mastering Stemming and Lemmatization on Hyperskill Blog.

10 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo