Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsString similarity

Wagner-Fischer algorithm

13 minutes read

In various real-world situations, gauging the similarity between strings is crucial. Consider identifying misspelled words, discerning genetic similarities, or matching patterns in datasets. These are just a few real-life problems where the Levenshtein distance steps in as a reliable solution. Remember that Levenshtein distance employs three fundamental transformation operations: insertion, deletion, and substitution, to determine the minimum edit distance between two strings.

In the previous topics, we learned the Levenshtein distance using the naive recursive approach. While the naive recursive approach seemed like a very intuitive and straightforward method for calculating the Levenshtein distance, it proved inefficient for longer strings due to overlapping subproblems. To eliminate redundant calculations caused by overlapping subproblems, we turned to dynamic programming.

Transitioning to dynamic programming, we explored the Levenshtein distance using the top-down approach, which employed memoization or caching to eliminate redundancy. With this foundation, let's explore the Levenshtein distance using the bottom-up dynamic programming approach, also known as the Wagner-Fisher Algorithm. It is an even more efficient solution due to its iterative nature.

Intuition

The Levenshtein distance between two input strings aa and bb is the minimum number of edits needed to transform aa into bb. The bottom-up dynamic programming approach for Levenshtein distance works by iteratively building up the solution to the problem from the smallest subproblems to the largest.

The smallest subproblems are the cases where one or both of the input strings are empty. The solutions to these subproblems are known since the Levenshtein distance between an empty string and any other string is simply the length of the other string.

The larger subproblems are built by adding a character one by one to each of the words. The results of smaller subproblems can be used to compute the results of larger ones. For this purpose, we must store the results of every sub-problem in an array. That is, since we need the history of previous results (i.e. smaller subproblem) to compute the current result (i.e. larger subproblem), we must use an array as the dynamic programming table.

Here, we use a 2D array for this, say dpdp. But why a 2D array, you might ask? It's because the 2D array is indexed by the lengths of the two input strings aa and bb. The first dimension of the array represents the length of the first input string aa, and the second dimension of the array represents the length of the second input string bb. The value at a particular row and column index represents the Levenshtein distance between the substrings of the two input strings that are represented by the row and column indices.

For example, say, aa = "bat" and bb = "bed". The 2D array dpdp will look like the following:

2d dp array

Once we fill this 2D array dpdp , the Levenshtein distance between the two input strings aa and bb can be found by simply looking at the value of dp[len(a)][len(b)]dp[len(a)][len(b)]. Here, len(a)len(a) is the length of string aa and len(b)len(b) is the length of string bb.

Please note that we'll use 0-based indexing for arrays in this topic unless specified otherwise. The numbers next to words outside the 2D array indicate the row and column indices.

Algorithm steps

Now let's take a look at the algorithm steps for the bottom-up dynamic programming approach for calculating the Levenshtein distance between two input strings, aa and bb. For example, say, aa = "bat" and bb = "bed".

  1. Initialize variables: Let nn to denote the length of string aa (i.e., len(a)len(a)) and mm denote the length of string bb (i.e., len(b)len(b)).

  2. Create a 2D array: Build a 2D array named dpdp with dimensions (n+1)(m+1)(n + 1) * (m + 1). This array will store the results of the Levenshtein distance for each sub-problem.

  3. Calculate base cases: Begin by computing the Levenshtein distance for the smallest sub-problems and populate the dpdp array accordingly.

    • If string aa is empty (i.e., nn = 0), the Levenshtein distance is equal to the length of string bb (i.e., mm).
    • If string bb is empty (i.e., mm = 0), the Levenshtein distance is equal to the length of string aa (i.e., nn).

    These initial computations establish the values for the first row and first column of the dpdp array, which forms the basis for the subsequent computations, as shown in the figure below.

    dp array with base cases

  4. Iterative calculation: Use a nested loop to iterate over each combination of characters in string aa and string bb. The Levenshtein distance is calculated as follows:

    • Maintain two pointers, ii and jj for strings aa and bb, respectively. Let ii denote the current index of string aa, and jj denote the current index of string bb. It's essential to note that in this step, indices ii and jj are at index 1 of dpdp array.
    • The dp[i][j]dp[i][j] entry in the array stores the Levenshtein distance of substrings ending at index ii in string aa and index jj in string bb.
    • Handle matching characters: If the characters at the current indices ii and jj in string aa and string bb are the same, then the Levenshtein distance will be the same as the result of the sub-problem with string aa ending at index (i1)(i - 1) and string bb ending at index (j1)(j - 1).

      case with matching characters

    • Handle different characters: If the characters at the current indices ii and jj in string aa and string bb are different, then the Levenshtein distance will be the same as the minimum of the calculated distances of the following three operations plus one:

      case with different characters

      i. Inserting a character at index ii in string aa: The result can be obtained from dp[i][j1]dp[i][j - 1].

      case when inserting a character

      ii. Deleting the character at index ii in string aa: The result can be obtained from dp[i1][j]dp[i - 1][j].

      case when deleting a character

      iii. Substituting a character at index ii in string aa: The result can be obtained from dp[i1][j1]dp[i - 1][j - 1].

      case when substituting a character

      dp array with 3 available choices together

In this approach, the dpdp array stores the results of each subproblem, gradually building up solutions to larger subproblems and eventually yielding the overall Levenshtein distance at its lower-right cell. The final state of the dpdp array for the above example input strings aa = "bat" and bb = "bed" is shown below:

final filled 2D dp array

Pseudocode

Let's now delve into the pseudocode representation of the bottom-up dynamic programming approach for calculating the Levenshtein distance between two strings aa and bb.

By systematically filling the dpdp array, this approach iteratively computes the Levenshtein distance for various sub-problems. The resulting dp[n][m]dp[n][m] value represents the Levenshtein distance between the given strings aa and bb.

// Example usage:
a = "bat"
b = "bed"
levenshteinDistance = minDistance(a, b)
print(levenshteinDistance)

function minDistance(a, b):
    n = length(a) // Length of string `a`
    m = length(b) // Length of string `b`
    
    // Create a 2D array `dp` to store subproblem results
    dp = array[n + 1][m + 1]

    // Initialize base cases
    for i from 0 to n:
        dp[i][0] = i
    for j from 0 to m:
        dp[0][j] = j
    
    // Compute the Levenshtein distance using bottom-up DP approach (iterative)
    for i from 1 to n:
        for j from 1 to m:            
            if a[i - 1] == b[j - 1]:
                // When the characters match
                dp[i][j] = dp[i - 1][j - 1]

            else: 
            // When the characters don't match            
            // Choose the minimum of three operations:
            // 1. Insertion: dp[i][j-1] 
            // 2. Deletion: dp[i-1][j]
            // 3. Substitution: dp[i-1][j-1]
            dp[i][j] = min(
                dp[i][j - 1],    // Insertion
                dp[i - 1][j],    // Deletion
                dp[i - 1][j - 1] // Substitution
            ) + 1

    // Return the Levenshtein distance between the strings `a` and `b`
    return dp[n][m]

Complexity analysis

Let nn be the length of string aa (source string), and mm be the length of string bb (destination string).

  • Time Complexity: The core of the algorithm consists of two nested loops. The outer loop iterates nn times, representing the rows of the dynamic programming table. The inner loop iterates mm times, representing the columns of the table. Within each iteration of the inner loop, constant-time operations are performed. Hence, the overall time complexity is O(nm)O(n \cdot m), as each cell of the dynamic programming table (i.e. 2D array) dpdp is computed in constant time.
  • Space Complexity: The space complexity is determined by the need to create a 2D array, often referred to as the dpdp array, for storing the intermediate results of subproblems. The dimensions of this array are (n+1)(n + 1) rows and (m+1)(m+1) columns, corresponding to the lengths of the two input strings aa and bb. Therefore, the space required for the dpdp array is O(nm)O(n \cdot m), where nn and mm are the lengths of the two input strings, respectively.

The bottom-up dynamic programming approach's efficiency comes from its ability to eliminate redundancy by calculating and reusing values in a tabulated manner. This optimization allows it to efficiently handle larger input strings by effectively managing the computation and storage of subproblem solutions.

Space optimization

The bottom-up dynamic programming approach, also known as tabulation, offers an efficient solution for calculating the Levenshtein distance. However, there is an opportunity to optimize the algorithm's space complexity. The original implementation utilizes a 2D array to store subproblem results, consuming O(nm)O(n \cdot m) space. Through careful analysis, we can significantly reduce this space requirement while preserving the algorithm's efficacy.

The core idea behind space optimization lies in recognizing that during each iteration of the algorithm, we only need to retain and update the results of the previous row in the dynamic programming table. This insight enables the use of a 1D array to maintain the values of the previous row, resulting in a space complexity of O(min(n,m))O(min(n, m)).

Here's the pseudocode of the space-optimized algorithm:

// Example usage:
a = "bat"
b = "bed"
levenshteinDistance = minDistance(a, b)
print(levenshteinDistance)

// Achieving linear space complexity by using a 1D array (`prevRow`) to store the previous row.
// Space complexity: O(k), where k = min(length(a), length(b))
function minDistance(a, b):
    n = length(a)
    m = length(b)

    // If `a` is smaller than `b`, we swap the words for optimization and recursively call the function with the swapped words. This is because the algorithm only handles the case when `a` is longer than `b`; otherwise, we may get `ArrayOutOfBoundsException` in some programming languages.
    if n < m:
        // Recursively call the function with the flipped words (i.e. `source` becomes `destination`, and vice versa)
        return minDistance(b, a);

    // Initialize the previous row with values [0, 1, 2, ..., m]
    prevRow = array[m]
    for i = 0 to m:
        prevRow[i] = i

    // Loop through each row from 1 to `n` (inclusive)
    for i = 1 to n:
        // Initialize a new row array
        row = array[m]
        row[0] = i // Initialize the first column
        
        // Loop through each column from 1 to `m` (inclusive)
        for j = 1 to m:
            // Calculate the minimum edit distance for the current cell
            insertionCost = 1 + prevRow[j] // Insertion cost
            deletionCost = 1 + row[j - 1] // Deletion cost
            substitutionCost = (0 if a[i - 1] == b[j - 1] else 1) + prevRow[j - 1] // Substitution cost
            
            // Calculate the minimum of the three possible operations
            val = min(insertionCost, deletionCost, substitutionCost)
            
            // Update the current cell with the minimum cost
            row[j] = val
            
        // Update the previous row with the current row
        prevRow = row

    // Return the bottom-right value of prevRow, representing minimum edit distance
    return prevRow[m]

By implementing this space optimization, we attain a memory-efficient solution that upholds the accuracy of Levenshtein distance calculation. The algorithm's space complexity is significantly reduced, rendering it suitable for scenarios where optimal memory utilization is a priority.

Conclusion

In this topic, we've delved into the bottom-up dynamic programming approach for calculating the Levenshtein distance. Let's recap the key points we've explored:

  • Levenshtein distance serves as a vital measure of string similarity, quantifying the minimum number of edit operations required to transform one string into another.
  • The naive recursive approach, while intuitive, becomes inefficient for longer strings due to the exponential growth of redundant calculations arising from overlapping subproblems.
  • Dynamic programming offers an efficient solution by breaking down the problem into smaller subproblems and reusing their solutions.
  • The bottom-up dynamic programming approach starts from simple subproblems and iteratively builds solutions for larger ones, effectively eliminating redundant calculations.
  • By using a 1D array to store only the previous row of the dynamic programming table, the bottom-up approach achieves linear space complexity.
  • Through our exploration, we've discovered that the time and space complexity of the bottom-up approach remains O(nm)O(n \cdot m), where nn and mm represent the lengths of the input strings.

By grasping the concepts of this bottom-up approach, we've gained a crucial tool for efficiently tackling the Levenshtein distance problem. This method equips us to quantify string similarity accurately, benefiting applications ranging from spell-checking to genetic analysis and beyond.

3 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo