Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsString similarity

Levenshtein distance

14 minutes read

When working with strings or sequences, it is often crucial to determine their similarity. Whether you're dealing with natural language processing, genetic analysis, or pattern matching, having a metric that quantifies the minimum number of operations required to convert one string to another is essential.

The ability to measure the similarity between strings plays a fundamental role in numerous applications and computational problems. What makes this concept more interesting is the fact that it has several real-world applications. To address these needs, we turn to the concept of string distance or edit distance, or string similarity. You can think of it as trying to find the similarity score between two strings. The lower the edit distance, the more similar they are.

There are different types of edit distance, each having different types of operations to transform one string into another. In this topic, you'll be learning about the Levenshtein distance.

Levenshtein distance: a measure of string similarity

The Levenshtein distance finds edit distances by allowing three types of transformation operations: insertion, deletion, and substitution. Levenshtein distance is the most widely used measure of string similarity. By quantifying the minimum number of editing operations required to transform one string into another, the Levenshtein distance provides a valuable metric for assessing similarity.

Spell checker

In many scenarios, we need to assess the similarity between two strings based on their structural differences. For example, when implementing a spell-checking algorithm, we want to identify misspelled words by comparing them to a dictionary. Similarly, in DNA sequence analysis, understanding the genetic similarities and differences between two sequences is vital. Furthermore, fuzzy string matching, which finds approximate matches for a given pattern, relies on quantifying the similarity between strings.

Operations

The Levenshtein distance function da,b(n,m)d_{a,b}(n, m) determines the minimal number of operations required to transform string aa into string bb, whose lengths are nn and mm, respectively. Available operations are the following:

  • Insertion of a single symbol;

Insertion of p

  • Deletion of a single symbol;

Deletion of r

  • Substitution of a single symbol.

Substitution of r with c

For example, the word ratrat can be transformed into armarm with three operations:

Sequence of operations above

As you can see above, we get armarm from ratrat by sequentially applying the operations of deletion, insertion, and substitution.

Note that the operations are applied from left to right. So first, we consider all the operations that we can apply to the first character, then we take a look at the second one, and continue applying possible operations to the letters one by one.

There are three main approaches for finding the Levenshtein distance between two strings aa and bb, namely regular recursion, top-down dynamic programming, bottom-up dynamic programming. Let's first look at the intuition behind the different approaches for finding the Levenshtein distance between two strings: aa and bb.

Intuition

To start off, let's lay down some fundamental guidelines:

  1. The Levenshtein distance between two identical strings is zero. For example, if aa = "cat" and bb = "cat", then there is no need to insert, delete, or substitute any characters. Therefore, the Levenshtein distance is 00.

  2. Operations (insert/delete/substitute) are performed only when a character in aa does not match the corresponding character in bb. For instance, if aa = "cat" and bb = "cab", then there is a mismatch at the third position. Because the character 't' in aa is different from the character 'b' in bb. In such cases, we have three options:

    • Option 1: Delete the character 't' from aa (and then later insert the character 'b' into aa).

    • Option 2: Insert the character 'b' into aa (and then later delete the character 't' from aa).

    • Option 3: Substitute the character 't' in aa with character 'b'.

  3. The objective is to choose the option that requires the fewest operations to transform aa into bb. Let's examine each option individually.

Let's take a moment to transform aa = "cat" and bb = "cab", using each of the above options. Looking at the image below, we can observe that Option 1 and Option 2 both require 22 operations to transform aa into bb, resulting in the Levenshtein distance of 22. However, Option 3 only requires 11 operation, resulting in the Levenshtein distance of 11.

Hence, in this case, we select Option 3, which yields a Levenshtein distance of 1.

Options for Levenshtein distance at each step

From the figure above, showing the 3 possible options for finding the Levenshtein distance required to transform string aa into string bb, we can conclude that:

Levenshtein distance to transform word "cat" to word "cab" = 
min(number of operations required after deleting 't' from word "cat", 
    number of operations required after inserting 'b' into word "cat",
    number of operations required after substituting 't' in word "cat" with 'b') + 1

Note that we add 11 to account for the current operation (deletion/insertion/substitution).

Whenever there is a mismatch between two characters, we need to evaluate all possible operations and select the most favorable option. And we need to do this for all the character mismatches. This implies that the above relation is recursive in nature.

Recursion is a natural approach to solving problems that involve exploring all possible solutions to determine the most optimal one. This is because by utilizing recursion, we can break down the problem into smaller subproblems until we reach a point where the solution is obvious (i.e. the base case).

Next, let's delve into the recursive approach for determining the Levenshtein distance. To begin with, let's first derive the recurrence relation for each possible operation in the Levenshtein distance calculation.

Recursive approach

Let's assume that we are given two input strings, aa and bb, where i,ji, j are the character indices of these words. We can represent aa and bb as a=a1...ai...ana=a_1...a_i...a_n and b=b1...bj...bmb=b_1...b_j...b_m. Our goal is to find the Levenshtein distance required to transform aa into bb. Remember that the Levenshtein distance is the number of operations (deletion/insertion/substitution) needed to make the strings equal.

To solve this problem, we can use the recursive approach. We'll define a recursive function called minDistance()minDistance(), which calculates the Levenshtein distance for substrings of aa and bb. The function takes two indices, ii and jj, to track the current characters being compared.

  1. Character comparison: For each comparison of characters at indices, ii and jj, we have two possibilities:

    • If the characters match (aia_i == bjb_j), we move to the next index without performing any operation.

    • If the characters don't match (aia_i \neq bjb_j), we need to consider the three operations: deletion, insertion, and substitution.

  2. Recurrence relations: The recurrence relation for each possible operation is as follows:

    • Substitution: minDistance(a,b,i−1,j−1) + 1

    • Insertion: minDistance(a,b,i,j−1) + 1

    • Deletion: minDistance(a,b,i−1,j) + 1

    Let's derive it using an example with actual strings. Say, aa = "bat" and bb = "bed".

    Recurrence relation for each possible operation

  3. Overall recurrence relation: The minimum Levenshtein distance will be the minimum of the above three operations. We can write this in the form of pseudocode as:

    if a[i] == b[j]:
        return minDistance(a, b, i - 1, j - 1)
    
    if a[i] != b[j]:
        return min(
                  minDistance(a, b, i - 1, j - 1),
                  minDistance(a, b, i, j - 1),
                  minDistance(a, b, i - 1, j)
               ) + 1
  4. Base cases: We define the base cases for the recursive function:

    • If aa is empty, the edit distance is the number of characters in bb.

    • If bb is empty, the edit distance is the number of characters in aa.

By recursively applying these principles and considering all possible combinations, we can find the Levenshtein distance between aa and bb as shown using the following pseudocode:

// Get the lengths of strings `a` and `b`
n = length(a)
m = length(b)

// Calling recursive method to calculate Levenshtein distance between strings `a` and `b`
levenshteinDistance = minDistance(a, b, n, m)
print(levenshteinDistance)

function minDistance(a, b, n, m):    
    // Base case: if `a` is empty, return the length of `b`
    if n == 0:
        return m
    
    // Base case: if `b` is empty, return the length of `a`
    if m == 0:
        return n
    
    // If the characters at the current indices are equal, no operation is needed
    if a[n - 1] == b[m - 1]:
        // Recursively call minDistance() method with updated indices
        return minDistance(a, b, n - 1, m - 1)
    else:
        // Try all three operations and choose their minimum
        // Insertion: decrement m
        insertOperation = minDistance(a, b, n, m - 1)
        
        // Deletion: decrement n
        deleteOperation = minDistance(a, b, n - 1, m)
        
        // Substitution: decrement both n and m
        replaceOperation = minDistance(a, b, n - 1, m - 1)
        
        // Return the `minimum of the three operations` plus 1
        return min(insertOperation, min(deleteOperation, replaceOperation)) + 1

Now, let's take a look at the following recursion tree to get a deeper understanding of this recursive approach to Levenshtein distance. Here, aa = "bat" and bb = "bed".

Recursive algorithm

While the recursive approach is a straightforward method for calculating the Levenshtein distance and works well, it is pretty exhaustive and has some drawbacks that should be considered. One major drawback is its inefficiency for long strings due to overlapping subproblems.

The recursive approach involves breaking down the problem into smaller subproblems and solving them recursively. However, this can lead to redundant calculations, as the same subproblems are often encountered multiple times. As the length of the strings increases, the number of recursive calls and redundant computations grows exponentially, resulting in a significant increase in computational time and memory usage.

The inefficiencies due to the redundant calculations can be removed using dynamic programming. We'll look at those optimized implementations of Levenshtein distance using top-down dynamic programming and bottom-up dynamic programming approaches in the further topics ahead.

Complexity analysis

Let nn be the length of string aa and mm be the length of string bb. We define MMas the minimum of nn and mm i.e. min(n,m)\text{min}(n, m).

  • Time complexity: The algorithm exhibits an exponential time complexity of O(3M)O(3^M). When comparing each pair of strings, if the characters at the current indices do not match, we recursively explore three possibilities. In the worst-case scenario, where none of the characters match, the algorithm will explore O(3M)O(3^M) possibilities.

  • Space complexity: The space complexity is determined by the recursion depth, which is equivalent to the depth of the recursion tree. As the recursive process terminates when either string aa or string bb becomes empty, the maximum depth of the recursion tree is MM. Consequently, the overall space complexity is O(M)O(M).

The space complexity does not consider the space used by the input strings and the space used for function call parameters, as they are considered part of the input and output.

Conclusion

The Levenshtein distance provides a valuable measure of similarity between strings or sequences. By quantifying the minimum number of editing operations required to transform one string into another, it serves as a fundamental tool in various computational problems. Here are some main points we have covered in this topic:

  • Levenshtein distance as a measure of editing operations and its significance in spell checking, DNA sequence analysis, and fuzzy string matching.

  • The three main approaches for calculating the Levenshtein distance: are the regular recursive approach, the top-down dynamic programming approach, and the bottom-up dynamic programming approach.

  • Derivation and analysis of both the time and space complexity.

6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo