Computer scienceData scienceNLPMain NLP tasksSpelling Correction

Spell Checking: Implementation Overview

6 minutes read

Spelling correction serves as a preprocessing step for various machine learning algorithms. Today, we will examine methods for fast and efficient spelling correction in Python. Now that we understand the basics of spelling correction, we can better comprehend the underlying mechanisms.

TextBlob

This is not merely a spell checker; it is a library that tackles a variety of fundamental NLP tasks such as part-of-speech tagging, sentiment analysis, translation, and more. You can install TextBlob using pip install textblob. We will use its correct() method to fix our input text.

from textblob import TextBlob

misspelled_words = ['wter', 'pthon language', 'moom', 'monly', 'acress whose']
corrected_words = []

for w in misspelled_words:
   text_blob = TextBlob(w)
   text_corrected = text_blob.correct()
   corrected_words.append(text_corrected)

print(', '.join(map(str, corrected_words)))

OUTPUT- water, then language, room, only, across whose

Although TextBlob does correct the misspelled words, it's worth noting that it only offers isolated-term correction. In other words, each correction is made without considering the surrounding words.

One significant advantage of TextBlob is that its Word objects include a spellcheck() method, which returns a list of tuples containing suggested corrections along with their likelihood.

from textblob import TextBlob, Word

w = Word('acress')
print(w.spellcheck())

OUTPUT-[('across', 0.6851851851851852), ('access', 0.1728395061728395), 
('acres', 0.1111111111111111), ('actress', 0.021604938271604937), ('caress', 0.009259259259259259)]

Pyspellchecker

Pyspellchecker is an open-source package designed for spelling correction and suggesting candidate spellings for misspelled words. To install the package, you can use the following pip command: pip install pyspellchecker. Initially, you'll need to create a SpellChecker object. Subsequently, you can utilize the unknown() function to identify misspelled words, the correction() function to correct them, and the candidates() function to explore all possible suggestions for the erroneous token. The package also allows you to specify the language using the language parameter. Pyspellchecker supports multiple languages including Russian, English, Spanish, and Arabic.

from spellchecker import SpellChecker

spell = SpellChecker(language='en')

# find those words that may be misspelled
misspelled = spell.unknown(['telivision', 'capitel', 'standind'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

OUTPUT-
capital
{'capital', 'capitol'}
standing
{'standing', 'standin'}
television
{'television'}

Besides English, Pyspellchecker also supports other languages such as Spanish, German, French, and Portuguese. Here's an example that demonstrates spelling correction for German words:

spell = SpellChecker(language='de')

misspelled = spell.unknown(['vile', 'Shvester', 'Hare'])

for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

OUTPUT-
schwester
{'schwester'}
habe
{'hate', 'habe', 'haare', 'bare', 'are', 'hard', 'here', 'hart', 'haue', 'höre', 'care', 'have', 'hure', 'har', 'share', 'ware', 'harte'}
viel
{'viel', 'vive', 'eile', 'viele'}

Jamspell

Jamspell is a spelling checker library that offers more than just dictionary-based corrections; it also suggests context-sensitive corrections. The library is multi-lingual, and you even have the option to use and train your own manually-collected datasets.

You can work with Jamspell via a Docker container. Instructions for installing Docker Desktop and working with Jamspell through Python's requests module can be found on GitHub.

import requests

misspelled_words = ['wter', 'pthon language', 'run away form', 'monly', 'acress whose']
corrected_words = []

for w in misspelled_words:
   res = requests.post("http://127.0.0.1:5050", json={ "string_correction" : w})
   output = res.json() # output contains processed string
   corrected_words.append(output)
print(', '.join(corrected_words))

OUTPUT- water, python language, run away form, only, actress whose

In this example, you can observe context-sensitive correction at work. Take the misspelled string 'acress whose' for instance. Although "across" is statistically more probable, Jamspell suggests "actress" as the correct word due to the context provided by "whose"—a relative pronoun that is not used with inanimate nouns. However, it's worth noting that Jamspell struggles with real-word errors; for example, it leaves the string "run away form" unchanged instead of correcting it to "run away from."

SpaCy

Another way to perform spelling correction is to use SpaCy library. First, you'll need to install both the spacy and the contextualSpellCheck libraries using the following commands:

!pip install spacy
!pip install contextualSpellCheck

Next, we'll add contextualSpellCheck to the main preprocessing pipeline of SpaCy using the .add_to_pipe method.

nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)

This model is based on the BERT model and attempts to make corrections based on context. To illustrate, we'll add some context to our example sentences. To obtain the corrected text, we first apply our pipeline to each sentence and then access the ._.outcome_spellCheck attribute.

for word in ['wter was running in the sink', 'pthon programming language',
             'run away form this man ',
'monly stipend is gived to students', 'acress whose who left messages to you']:
  doc = nlp(word)
  print(doc._.outcome_spellCheck)

# water was running in the sink
# Python programming language

# No land is given to students
# Those whose who left messages to you

Note that our third example remained unchanged, while the other examples were correctly modified.

Conclusion

In this article, we've explored several methods for spell correction in Python. Any of these can serve as a preprocessing step to ensure that your data is cleaner for the main machine learning model.

How did you like the theory?

Report a typo