Spelling correction involves finding and fixing spelling errors and is vital for various NLP applications, including web search, text summarization, and sentiment analysis. This feature, offered by various software tools, is common in word processors, emails, smartphones, blogs, and forums, allowing real-time corrections, overall document scanning for errors, and alerts for misspelled words.
Spelling error types
The types of misspellings define the first step of solving the spelling correction problem. Specifically, the kind of mistake dictates the method for recognizing the error. The first level of error classification is as follows:
- Orthographical or cognitive mistakes arise from cognitive processing errors. For example, if one is unsure of how to spell Britney Spears, they could make a guess and spell it as Britany Spirs. Another instance of cognitive misspelling is the incorrect selection of homophones, words with similar pronunciation or spelling but different meanings. For instance, typing to instead of too.
- Typographical errors, on the other hand, are simply the result of mistyping, whether due to the so-called fat-finger syndrome, fast typing, or keyboard adjacency. For instance, such mistypes as helkjlo or grapohic. Sometimes, we can find or understand the mistake by looking at the keyboard. If one letter is close to another on a keyboard, one can type one letter instead of another.
While misspellings originating from deeper cognitive levels often yield real-world spelling errors, typographical mistakes typically result in non-word errors. This introduces the second classification.
This second distinction is likely even more relevant to the specific NLP problem:
- Non-word: the spelling mistake when the mistyped word cannot be found in the dictionary
- Real-word errors: the misspelled word is an acceptable dictionary term but differs from the intended word — for example, homophone errors
The forenamed classifications are important because they define how spelling errors will be recognized and addressed.
Approaches to spelling correction
Spelling correction can be observed as the task of first finding the incorrect word and then transforming it. This correction process can be divided into isolated-term correction and context-sensitive correction.
Isolated-term correction corrects misspelled words one term at a time without considering the word's context. There are a few popular ways to perform isolated-term correction:
- The dictionary lookup technique: Every term of an input string is checked for its presence in the dictionary before anything else. It is logical to assume that any term not found in the dictionary is an error. At this point, non-word errors come to mind. Discovering a non-word error is easier since it only requires ensuring that the term is not in the dictionary. However, it can be considered as a drawback of this method.
- Edit distance: also known as the Levenshtein distance or the minimum edit distance. It measures the operation counts (insertions, deletions, substitutions) to correct a misspelled word by comparing it to a reference dictionary and seeking potential corrections with minimal distance. The word with the minimum distance can be used as a corrected word.
Consider two strings, A and B, that you want to compare with edit distance:
- Insertion: Add a character to string A.
- Deletion: Remove a character from string A.
- Substitution: Replace a character in string A with a different character.
- K-gram overlap: It compares misspelled words and potential corrections by measuring the similarity of their character sequences based on shared k-grams. The whole process includes the following steps:
-
K-gram Indexing: First, a dictionary of correctly spelled words is indexed using k-grams. For each word in the dictionary, -grams of a fixed length ( characters) are extracted from the word. These k-grams are then stored in a data structure, allowing efficient lookup.
-
Misspelled Word Processing: When a potentially misspelled word is encountered, the same -gram extraction process is applied, generating k-grams from the input word.
-
Comparison: -grams from the misspelled word are compared to the -grams stored in the dictionary index. The algorithm looks for matching or similar k-grams between the misspelled word and words in the dictionary.
-
Candidate Selection: A set of candidate corrections is generated based on the similarity of -grams. These candidates are words from the dictionary that share similar -grams with the misspelled word.
-
The primary limitation of isolated term correction is that it cannot account for homophones or words that may be spelled correctly but are misused within the context of the sentence. For example, the previously mentioned flights form Heathrow won't be corrected by isolated-term correction since all words are spelled correctly, but the phrase is meaningless.
Context-sensitive term correction identifies and corrects spelling errors by considering the context in which the misspelled term appears to make more accurate corrections. In context-sensitive term correction, the algorithm analyzes the neighboring words, grammar, and semantic context to determine the most likely correct spelling for the misspelled term. This approach is instrumental when dealing with homophones and cases where the misspelled term is valid but not the intended word in the given context.
Context-sensitive term correction algorithms can be split into two categories:
- Statistical methods estimate the likelihood of a particular word given its context and suggest corrections based on the most probable word sequence. An example of an algorithm in this section is the noisy channel model, which perceives a misspelled word as a correctly spelled one that has been corrupted through transmission via a corrupt communication channel.
- Machine learning methods mostly use word embeddings, where each word is associated with a numerical vector. Language models such as Word2Vec or ELMo could be used to correct spelling.
Conclusion
In NLP applications, spelling correction is critical in eliminating errors that can negatively affect natural language processing accuracy.
This topic covers two main types of spelling errors: orthographic/cognitive and typographical. It also discusses two classifications of spelling errors: non-word and real-word. We've also presented an overview of spelling correction approaches, such as isolated-term and context-sensitive correction. Techniques used in isolated-term correction include dictionary lookup, edit distance, and k-gram overlap. While effective for correcting single-word spelling errors, context-sensitive correction is necessary for correcting contextual spelling errors.