Computer scienceData scienceNLPNLP metricsText similarity metrics

Levenshtein distance

8 minutes read

Previously, you learned the basics of the edit distance (also known as the Levenshtein distance) that calculates the minimal number of operations for changing one string to another. This method was created in 1965 by Vladimir Levenshtein, a Russian mathematician. In this topic, we will take another look at the algorithm, write the implementation, and, finally, discuss the libraries for the automatic distance calculation.

Quick recap

The Levenshtein distance determines the minimal number of operations to transform one string into another. We have three operations: inserting a character, removing a character, and substituting a character.

The distance is calculated with the help of an array called the distance matrix. It is a table of $n$ rows and $m$ columns that shows the Levenshtein distance value for each pair of characters (or their sequences) in the given strings. When it's completed, it can look something like this:

There is a distance matrix of words "cats" and "map".

How do you fill a distance matrix? The steps are the following:

Create a distance matrix (we've done it); the number of rows and columns is equal to the number of characters in the source and target strings;
Add an extra row and column that will hold the characters' indices;
Fill the cells with the scores using the following formula:
After filling up the matrix, take the score in the lower-right cell. This one will be the Levenshtein distance between the strings.

In this formula, $a$ and $b$ stand for the first and the second words correspondingly; $i$ and $j$ are the indexes of their characters, so $a_i$ refers to the character in the $a$ string at the $i$ -th position.

This was just a quick recap if you don't remember all the details. You can find more in the topic on Edit distance.

Now that you've refreshed your memory, let's try to implement this algorithm in Python!

Python implementation

Here is one of the ways to calculate the Levenshtein distance in Python with the help of the NumPy library. Let's write the function to calculate the distance step by step.

Create a function that takes two strings: string1 and string2. Two variables will store the dimensions of our matrix. We add 1 to both values because we need one extra row and column for indexes:

import numpy as np


def lev_dist(string1, string2):
    # add an extra row and an extra column to a matrix
    size_1 = len(string1) + 1
    size_2 = len(string2) + 1

Fill up the matrix with zeros. We can achieve it with np.zeros(). Don't forget that the extra row and column should be filled with indexes:

    # fill up the matrix with zeros
    dist_matrix = np.zeros((size_1, size_2))
    # fill up the first (extra) column and row
    for i in range(size_1):
        dist_matrix[i, 0] = i
    for y in range(size_2):
        dist_matrix[0, y] = y

What should we do with the formula described in the previous section? We can use a nested for-loop to sort out all the possible indexes in both strings and insert them into the formula. Mind that we start the range with 1, as the first extra row and column are already filled.

    # pick over all the indices excluding the indices of the extra row and column
    for i in range(1, size_1):
        for y in range(1, size_2):
            # use the formula
            dist_matrix[i, y] = min(
                    dist_matrix[i-1, y] + 1,
                    dist_matrix[i, y-1] + 1,
                    # don't add 1 if strings are equal
                    dist_matrix[i-1, y-1] if string1[i-1] == string2[y-1] else dist_matrix[i-1, y-1] + 1)

    return dist_matrix

Do you remember that we must choose the minimal score out of the three obtained results? We can use the min() function for this purpose. So, let's have a look at the code we have:

import numpy as np


def lev_dist(string1, string2):
    # add an extra row and an extra column to a matrix
    size_1 = len(string1) + 1
    size_2 = len(string2) + 1
    # fill up the matrix with zeros
    dist_matrix = np.zeros((size_1, size_2))
    # fill up the first (extra) column and row
    for i in range(size_1):
        dist_matrix[i, 0] = i
    for y in range(size_2):
        dist_matrix[0, y] = y
    # sort through all the indexes excluding the indexes of the extra row and column
    for i in range(1, size_1):
        for y in range(1, size_2):
            # use the formula
            dist_matrix[i, y] = min(
                    dist_matrix[i-1, y] + 1,
                    dist_matrix[i, y-1] + 1,
                    # don't add 1 if strings are equal
                    dist_matrix[i-1, y-1] if string1[i-1] == string2[y-1] else dist_matrix[i-1, y-1] + 1)

    return dist_matrix

Now, let's check how the function works. Let's look at the following pairs: tomato and potato, fruit and ruin:

check1 = lev_dist('potato', 'tomato')
print(check1)
# [[0. 1. 2. 3. 4. 5. 6.]
#  [1. 1. 2. 2. 3. 4. 5.]
#  [2. 2. 1. 2. 3. 4. 4.]
#  [3. 3. 2. 2. 3. 4. 5.]
#  [4. 4. 3. 3. 2. 3. 4.]
#  [5. 5. 4. 3. 3. 2. 3.]
#  [6. 6. 5. 4. 4. 3. 2.]]

check2 = lev_dist('fruit', 'ruin')
print(check2)
# [[0. 1. 2. 3. 4.]
#  [1. 1. 2. 3. 4.]
#  [2. 1. 2. 3. 4.]
#  [3. 2. 1. 2. 3.]
#  [4. 3. 2. 1. 2.]
#  [5. 4. 3. 2. 2.]]

We can check that our function created the distance matrix and calculated the Levenshtein distance correctly.

Additional Python libraries

Of course, you can create your implementation of the Levenshtein distance, but it's not necessary, as there are libraries that can help you with this. Let's consider two of them.

The first one is NLTK. It has the nltk.edit_distance() function that takes two strings and returns the edit distance. Let's check how it works:

import nltk

string1 = 'tomato'
string2 = 'potato'
print(nltk.edit_distance(string1, string2))  # 2

string3 = 'fruit'
string4 = 'frustrated'
print(nltk.edit_distance(string3, string4))  # 6

Another library is called Levenshtein; it compares strings directly. You can find it on GitHub. First, we need to install it with pip:

pip install python-Levenshtein

Then we need to import it:

from Levenshtein import distance

Now, we are ready to calculate the distance!

print(distance('winner', 'twins'))  # 4

The result is correct: we need to add t at the beginning of the source string. Then we need to replace the second n with s and remove e and r in the source string.

This library has many interesting functions; you can read more about them in the documentation.

How can it help?

The Levenshtein distance has many applications in NLP. First of all, it's great for the correction of spelling mistakes.

Levenshtein distance helps find the closest word from the given vocabulary, so it can be used both in text editors and in search engines to correct users' errors;
It is used in optical character recognition (OCR). Sometimes a handwritten word can be recognized wrong, and the Levenshtein distance can suggest possible corrections;
It is also helpful in machine translation;
Some errors can also occur during the conversion of audio into text. The Levenshtein distance can be used in postediting.

Conclusion

The Levenshtein distance is a useful method to compare two different strings and find out whether they're similar or different. So far, we have learned:

How to implement the Levenshtein distance in Python;
How the NLTK and Levenshtein libraries can help with calculating the distance;
How the distance is utilized in NLP.

Now, let's repeat everything you have learned by doing some practical tasks.

7 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo