Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsString similarity

Levenshtein distance: another approach

11 minutes read

Imagine a world where we can effortlessly identify misspelled words, understand genetic similarities, or find approximate matches for patterns within a vast dataset. These are just a few real-life problems where the Levenshtein distance comes to the rescue. Remember that Levenshtein distance employs three fundamental transformation operations: insertion, deletion, and substitution, to determine the minimum edit distance between two strings.

We have already been introduced to the Levenshtein distance, a widely used measure of string similarity, using the naive recursive approach. While the naive recursive approach provided a very intuitive and straightforward method for calculating the Levenshtein distance, we quickly understood its drawbacks that we should consider. One major drawback was its inefficiency for long strings due to overlapping subproblems.

To overcome these challenges and optimize the Levenshtein distance calculation, we turn to dynamic programming. In this topic, we focus on the top-down dynamic programming approach, which efficiently eliminates redundant calculations caused by overlapping subproblems. It breaks down the problem into smaller subproblems, solves them recursively, and saves the solution of subproblems for future use.

Intuition

Let's first examine the intuition behind the top-down dynamic programming approach for finding the Levenshtein distance between the two input strings: aa and bb.

Initially, a naive recursive approach might seem like a very intuitive, straightforward, and easy method for calculating the Levenshtein distance, which works well. However, it proves to be quite exhaustive. As the length of the input strings increases, the number of recursive calls and redundant computations grows exponentially. The end result? This inefficiency can lead to significant computational overhead and memory usage, making it impractical for lengthy strings.

Take a look at the following recursion tree of the naive recursive approach for calculating Levenshtein distance, that we observed in a previous topic. Here, aa = "bat" and bb = "bed".

Recursion tree showing redundant calculations

The recursion tree highlights repetitive calculations using the same color. There are multiple evaluations of substring combinations of strings aa and bb, leading to unnecessary redundant computations.

However, we can optimize these redundant recursive calls and avoid such inefficiencies. To achieve this optimization, we can implement a technique called memoization. Memoization enables us to store the results of previously computed sub-problems, allowing us to reuse these results and avoid redundant calculations.

By caching or memoizing the results of each sub-problem, the algorithm becomes more efficient. Before computing the result for a particular sub-problem, it checks if the result is already present in the cache. If so, it terminates the computation and directly retrieves the pre-computed result from the cache, eliminating the need for repeated evaluations.

This concept of memoization introduces us to a widely used programming paradigm: dynamic programming. In this topic, we utilize the top-down dynamic programming approach, also known as memoization, to efficiently solve the problem of Levenshtein distance. Let's take a closer look at this approach and its implementation.

Top-down dynamic programming approach

Let's consider two input strings, denoted as aa and bb, with character indices ii and jj, respectively. We can represent aa as a=a1...ai...ana=a_1...a_i...a_n and bb as b=b1...bj...bmb=b_1...b_j...b_m. Our goal is to determine the Levenshtein distance required to transform aa into bb. This distance represents the minimum number of operations (deletion, insertion, or substitution) needed to make the two strings equal.

To accomplish this, we define a recursive function called minDistance(). This function calculates the Levenshtein distance for substrings of aa and bb, using two indices ii and jj to track the current characters being compared at the moment.

The top-down dynamic programming approach is simply an extension of the naive recursive approach. In fact, it closely resembles the naive recursive approach, with only a few minor adjustments. This approach involves incorporating caching within the recursive function calls. To store the result of each sub-problem, we employ a data structure known as a memoization table or cache. By using this cache, we optimize the calculations by storing and reusing previously computed results.

By incorporating this caching technique, we significantly reduce redundant recursive calls. This makes the top-down dynamic programming approach more efficient and capable of handling larger inputs for calculating the Levenshtein distance between two strings.

Now, let's delve into the implementation steps of the top-down dynamic programming approach.

  1. Caching mechanism:

    • In a recursive call with specific parameters (e.g. minDistance(a, b, i, j)), we store the result of the sub-problem with string aa ending at index ii and string bb ending at index jj in the cache.

    • Before computing the result of a sub-problem, we check if it already exists in the cache (e.g. memo[i][j]). If the result is cached, we directly retrieve it, avoiding redundant computations.

    • After computing the result of a sub-problem, we store the result in the cache for future use. So, when encountering the same sub-problem in subsequent recursive calls, we can quickly retrieve its result from the cache (e.g. memo[i][j]).

    Cache array (or memoization table)

    cache array for top-down DP approach

    Cache often takes the form of a multi-dimensional array, and its dimensions are determined by the number of variable parameters passed in the recursive function. For instance, in the case of minDistance(a, b, i, j), we pass a total of 2 variable parameters, namely ii and jj. Consequently, we utilize a 2-dimensional array as the cache. However, if we had only 1 variable parameter, we would use a 1-dimensional array as the cache.

    This insight usually emerges from solving numerous problems involving memoization. Nevertheless, even from a logical standpoint, it's quite understandable. Because we want to store the results for each combination of ii and jj, as these indices represent the current characters being compared in strings aa and bb during the recursive process.

  2. Base cases: We define the base cases for the recursive function:

    • If aa is empty, the edit distance equals the number of characters in bb.

    • If bb is empty, the edit distance equals the number of characters in aa.

  3. Character comparison: When comparing characters at indices, ii and jj, two possibilities arise:

    • If the characters match (aia_i == bjb_j), we move to the next index without performing any operation.

    • If the characters don't match (aia_i \neq bjb_j), we need to consider the three operations: deletion, insertion, and substitution.

  4. Recurrence relations: The recurrence relation for each possible operation is as follows:

    Let's derive it using an example with actual strings. Say, aa = "bat" and bb = "bed".

    Recurrence relation for each possible operation

    • Substitution: minDistance(a,b,i−1,j−1) + 1

    • Insertion: minDistance(a,b,i,j−1) + 1

    • Deletion: minDistance(a,b,i−1,j) + 1

  5. Overall recurrence relation: The minimum Levenshtein distance is the minimum of the three operations mentioned above. We can write this in the form of pseudocode as:

    if a[i] == b[j]:
        return minDistance(a, b, i - 1, j - 1)
    
    if a[i] != b[j]:
        return min(
                  minDistance(a, b, i - 1, j - 1),
                  minDistance(a, b, i, j - 1),
                  minDistance(a, b, i - 1, j)
               ) + 1

By recursively applying these principles and considering all possible combinations, we can find the Levenshtein distance between aa and bb as shown using the following pseudocode:

// Example usage:
a = "bat"
b = "bed"
levenshteinDistance = minDistance(a, b)
print(levenshteinDistance)

function minDistance(a, b):
    n = length(a) // Get the length of string `a`
    m = length(b) // Get the length of string `b`
    
    // Create a `cache` or `memoization table`, of `integer` type initialized with `null` values
    memo = Integer(array[n + 1][m + 1])

    // Calling recursive method to calculate Levenshtein distance between strings `a` and `b`
    return calculateDistance(a, b, n, m, memo)

function calculateDistance(a, b, n, m, memo):    
    // Check if the result is already memoized
    if memo[n][m] is not null:
        return memo[n][m]

    // Base case: if `a` is empty, return the length of `b`
    if n == 0:
        return m
    
    // Base case: if `b` is empty, return the length of `a`
    if m == 0:
        return n
    
    // If the characters at the current indices are equal, no operation is needed
    if a[n - 1] == b[m - 1]:
        // Recursively call calculateDistance() method with updated indices
        memo[n][m] = calculateDistance(a, b, n - 1, m - 1, memo)
        return memo[n][m]
    else:
        // Try all three operations and choose their minimum
        // Insertion: decrement m
        insertOperation = calculateDistance(a, b, n, m - 1, memo)
        
        // Deletion: decrement n
        deleteOperation = calculateDistance(a, b, n - 1, m, memo)
        
        // Substitution: decrement both n and m
        replaceOperation = calculateDistance(a, b, n - 1, m - 1, memo)
        
        // Return the minimum of the three operations plus 1
        memo[n][m] = min(insertOperation, min(deleteOperation, replaceOperation)) + 1
        return memo[n][m]

Now, let's turn our attention to the recursion tree of the top-down dynamic programming approach for calculating the Levenshtein distance. Here, aa = "bat" and bb = "bed".

You'll notice that the algorithm efficiently avoids repetitive calculations (highlighted using the same color). Whenever the algorithm encounters a previously computed subproblem for a particular substring combination of strings aa and bb, it retrieves the result directly from the cache. This way, it eliminates redundant computations. Observe that no further recursive calls were made from a previously computed subproblem. This optimization ensures that computations are performed only once, and the results are stored for future reference. The outcome? A highly efficient Levenshtein distance calculation.

Recursion tree for top-down dynamic programming approach

Complexity analysis

Consider nn as the length of string aa and mm as the length of string bb.

  • Time complexity: The time complexity of the memoization approach is O(nm)O(n⋅m) because for every combination of string aa and string bb, the result is computed only once and then stored in the memoization table. This optimization eliminates redundant calculations by reusing the previously computed results, significantly reducing the number of recursive calls.

  • Space complexity: The space complexity is O(nm)O(n⋅m) due to the additional 2D array memo, of size n×mn \times m. This memoization table is used to cache the results of previously computed sub-problems. Consequently, the algorithm consumes additional memory to store and retrieve these cached results, but it ensures efficient reuse of the computed values.

The space complexity does not consider the space used by the input strings and the space used for function call parameters, as they are considered part of the input and output.

Conclusion

Let's recap the key points we've explored throughout this topic:

  • Levenshtein distance quantifies string similarity by measuring the minimum editing operations needed to transform one string into another.
  • The naive recursive approach for calculating the Levenshtein distance is a very intuitive and straightforward method, but it is inefficient for long strings due to overlapping subproblems.
  • Using the top-down dynamic programming approach, also known as memoization, we can optimize the Levenshtein distance calculation, by avoiding redundant calculations.
  • Using the top-down dynamic programming approach, the time and space complexity can be reduced to O(nm)O(n⋅m).
8 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo