Previously, you learned the basics of the edit distance (also known as the Levenshtein distance) that calculates the minimal number of operations for changing one string to another. This method was created in 1965 by Vladimir Levenshtein, a Russian mathematician. In this topic, we will take another look at the algorithm, write the implementation, and, finally, discuss the libraries for the automatic distance calculation.
Quick recap
The Levenshtein distance determines the minimal number of operations to transform one string into another. We have three operations: inserting a character, removing a character, and substituting a character.
The distance is calculated with the help of an array called the distance matrix. It is a table of rows and columns that shows the Levenshtein distance value for each pair of characters (or their sequences) in the given strings. When it's completed, it can look something like this:
How do you fill a distance matrix? The steps are the following:
Create a distance matrix (we've done it); the number of rows and columns is equal to the number of characters in the source and target strings;
Add an extra row and column that will hold the characters' indices;
Fill the cells with the scores using the following formula:
After filling up the matrix, take the score in the lower-right cell. This one will be the Levenshtein distance between the strings.
In this formula, and stand for the first and the second words correspondingly; and are the indexes of their characters, so refers to the character in the string at the -th position.
This was just a quick recap if you don't remember all the details. You can find more in the topic on Edit distance.
Now that you've refreshed your memory, let's try to implement this algorithm in Python!
Python implementation
Here is one of the ways to calculate the Levenshtein distance in Python with the help of the NumPy library. Let's write the function to calculate the distance step by step.
Create a function that takes two strings:
string1andstring2. Two variables will store the dimensions of our matrix. We add1to both values because we need one extra row and column for indexes:
import numpy as np
def lev_dist(string1, string2):
# add an extra row and an extra column to a matrix
size_1 = len(string1) + 1
size_2 = len(string2) + 1Fill up the matrix with zeros. We can achieve it with
np.zeros(). Don't forget that the extra row and column should be filled with indexes:
# fill up the matrix with zeros
dist_matrix = np.zeros((size_1, size_2))
# fill up the first (extra) column and row
for i in range(size_1):
dist_matrix[i, 0] = i
for y in range(size_2):
dist_matrix[0, y] = yWhat should we do with the formula described in the previous section? We can use a nested for-loop to sort out all the possible indexes in both strings and insert them into the formula. Mind that we start the range with
1, as the first extra row and column are already filled.
# pick over all the indices excluding the indices of the extra row and column
for i in range(1, size_1):
for y in range(1, size_2):
# use the formula
dist_matrix[i, y] = min(
dist_matrix[i-1, y] + 1,
dist_matrix[i, y-1] + 1,
# don't add 1 if strings are equal
dist_matrix[i-1, y-1] if string1[i-1] == string2[y-1] else dist_matrix[i-1, y-1] + 1)
return dist_matrixDo you remember that we must choose the minimal score out of the three obtained results? We can use the
min()function for this purpose. So, let's have a look at the code we have:
import numpy as np
def lev_dist(string1, string2):
# add an extra row and an extra column to a matrix
size_1 = len(string1) + 1
size_2 = len(string2) + 1
# fill up the matrix with zeros
dist_matrix = np.zeros((size_1, size_2))
# fill up the first (extra) column and row
for i in range(size_1):
dist_matrix[i, 0] = i
for y in range(size_2):
dist_matrix[0, y] = y
# sort through all the indexes excluding the indexes of the extra row and column
for i in range(1, size_1):
for y in range(1, size_2):
# use the formula
dist_matrix[i, y] = min(
dist_matrix[i-1, y] + 1,
dist_matrix[i, y-1] + 1,
# don't add 1 if strings are equal
dist_matrix[i-1, y-1] if string1[i-1] == string2[y-1] else dist_matrix[i-1, y-1] + 1)
return dist_matrixNow, let's check how the function works. Let's look at the following pairs: tomato and potato, fruit and ruin:
check1 = lev_dist('potato', 'tomato')
print(check1)
# [[0. 1. 2. 3. 4. 5. 6.]
# [1. 1. 2. 2. 3. 4. 5.]
# [2. 2. 1. 2. 3. 4. 4.]
# [3. 3. 2. 2. 3. 4. 5.]
# [4. 4. 3. 3. 2. 3. 4.]
# [5. 5. 4. 3. 3. 2. 3.]
# [6. 6. 5. 4. 4. 3. 2.]]
check2 = lev_dist('fruit', 'ruin')
print(check2)
# [[0. 1. 2. 3. 4.]
# [1. 1. 2. 3. 4.]
# [2. 1. 2. 3. 4.]
# [3. 2. 1. 2. 3.]
# [4. 3. 2. 1. 2.]
# [5. 4. 3. 2. 2.]]We can check that our function created the distance matrix and calculated the Levenshtein distance correctly.
Additional Python libraries
Of course, you can create your implementation of the Levenshtein distance, but it's not necessary, as there are libraries that can help you with this. Let's consider two of them.
The first one is NLTK. It has the nltk.edit_distance() function that takes two strings and returns the edit distance. Let's check how it works:
import nltk
string1 = 'tomato'
string2 = 'potato'
print(nltk.edit_distance(string1, string2)) # 2
string3 = 'fruit'
string4 = 'frustrated'
print(nltk.edit_distance(string3, string4)) # 6Another library is called Levenshtein; it compares strings directly. You can find it on GitHub. First, we need to install it with pip:
pip install python-LevenshteinThen we need to import it:
from Levenshtein import distanceNow, we are ready to calculate the distance!
print(distance('winner', 'twins')) # 4The result is correct: we need to add t at the beginning of the source string. Then we need to replace the second n with s and remove e and r in the source string.
This library has many interesting functions; you can read more about them in the documentation.
How can it help?
The Levenshtein distance has many applications in NLP. First of all, it's great for the correction of spelling mistakes.
Levenshtein distance helps find the closest word from the given vocabulary, so it can be used both in text editors and in search engines to correct users' errors;
It is used in optical character recognition (OCR). Sometimes a handwritten word can be recognized wrong, and the Levenshtein distance can suggest possible corrections;
It is also helpful in machine translation;
Some errors can also occur during the conversion of audio into text. The Levenshtein distance can be used in postediting.
Conclusion
The Levenshtein distance is a useful method to compare two different strings and find out whether they're similar or different. So far, we have learned:
How to implement the Levenshtein distance in Python;
How the
NLTKandLevenshteinlibraries can help with calculating the distance;How the distance is utilized in NLP.
Now, let's repeat everything you have learned by doing some practical tasks.