Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsString similarity

Wagner-Fischer algorithm

13 minutes read

In various real-world situations, gauging the similarity between strings is crucial. Consider identifying misspelled words, discerning genetic similarities, or matching patterns in datasets. These are just a few real-life problems where the Levenshtein distance steps in as a reliable solution. Remember that Levenshtein distance employs three fundamental transformation operations: insertion, deletion, and substitution, to determine the minimum edit distance between two strings.

In the previous topics, we learned the Levenshtein distance using the naive recursive approach. While the naive recursive approach seemed like a very intuitive and straightforward method for calculating the Levenshtein distance, it proved inefficient for longer strings due to overlapping subproblems. To eliminate redundant calculations caused by overlapping subproblems, we turned to dynamic programming.

Transitioning to dynamic programming, we explored the Levenshtein distance using the top-down approach, which employed memoization or caching to eliminate redundancy. With this foundation, let's explore the Levenshtein distance using the bottom-up dynamic programming approach, also known as the Wagner-Fisher Algorithm. It is an even more efficient solution due to its iterative nature.

Intuition

The Levenshtein distance between two input strings $a$ and $b$ is the minimum number of edits needed to transform $a$ into $b$ . The bottom-up dynamic programming approach for Levenshtein distance works by iteratively building up the solution to the problem from the smallest subproblems to the largest.

The smallest subproblems are the cases where one or both of the input strings are empty. The solutions to these subproblems are known since the Levenshtein distance between an empty string and any other string is simply the length of the other string.

The larger subproblems are built by adding a character one by one to each of the words. The results of smaller subproblems can be used to compute the results of larger ones. For this purpose, we must store the results of every sub-problem in an array. That is, since we need the history of previous results (i.e. smaller subproblem) to compute the current result (i.e. larger subproblem), we must use an array as the dynamic programming table.

Here, we use a 2D array for this, say $dp$ . But why a 2D array, you might ask? It's because the 2D array is indexed by the lengths of the two input strings $a$ and $b$ . The first dimension of the array represents the length of the first input string $a$ , and the second dimension of the array represents the length of the second input string $b$ . The value at a particular row and column index represents the Levenshtein distance between the substrings of the two input strings that are represented by the row and column indices.

For example, say, $a$ = "bat" and $b$ = "bed". The 2D array $dp$ will look like the following:

2d dp array

Once we fill this 2D array $dp$ , the Levenshtein distance between the two input strings $a$ and $b$ can be found by simply looking at the value of $dp[len(a)][len(b)]$ . Here, $len(a)$ is the length of string $a$ and $len(b)$ is the length of string $b$ .

Please note that we'll use 0-based indexing for arrays in this topic unless specified otherwise. The numbers next to words outside the 2D array indicate the row and column indices.

Algorithm steps

Now let's take a look at the algorithm steps for the bottom-up dynamic programming approach for calculating the Levenshtein distance between two input strings, $a$ and $b$ . For example, say, $a$ = "bat" and $b$ = "bed".

Initialize variables: Let $n$ to denote the length of string $a$ (i.e., $len(a)$ ) and $m$ denote the length of string $b$ (i.e., $len(b)$ ).
Create a 2D array: Build a 2D array named $dp$ with dimensions $(n + 1) * (m + 1)$ . This array will store the results of the Levenshtein distance for each sub-problem.
Calculate base cases: Begin by computing the Levenshtein distance for the smallest sub-problems and populate the $dp$ array accordingly.
- If string $a$ is empty (i.e., $n$ = 0), the Levenshtein distance is equal to the length of string $b$ (i.e., $m$ ).
- If string $b$ is empty (i.e., $m$ = 0), the Levenshtein distance is equal to the length of string $a$ (i.e., $n$ ).
These initial computations establish the values for the first row and first column of the $dp$ array, which forms the basis for the subsequent computations, as shown in the figure below.
Iterative calculation: Use a nested loop to iterate over each combination of characters in string $a$ and string $b$ . The Levenshtein distance is calculated as follows:
- Maintain two pointers, $i$ and $j$ for strings $a$ and $b$ , respectively. Let $i$ denote the current index of string $a$ , and $j$ denote the current index of string $b$ . It's essential to note that in this step, indices $i$ and $j$ are at index 1 of $dp$ array.
- The $dp[i][j]$ entry in the array stores the Levenshtein distance of substrings ending at index $i$ in string $a$ and index $j$ in string $b$ .
- Handle matching characters: If the characters at the current indices $i$ and $j$ in string $a$ and string $b$ are the same, then the Levenshtein distance will be the same as the result of the sub-problem with string $a$ ending at index $(i - 1)$ and string $b$ ending at index $(j - 1)$ .
- Handle different characters: If the characters at the current indices $i$ and $j$ in string $a$ and string $b$ are different, then the Levenshtein distance will be the same as the minimum of the calculated distances of the following three operations plus one:
  
  i. Inserting a character at index $i$ in string $a$ : The result can be obtained from $dp[i][j - 1]$ .
  
  ii. Deleting the character at index $i$ in string $a$ : The result can be obtained from $dp[i - 1][j]$ .
  
  iii. Substituting a character at index $i$ in string $a$ : The result can be obtained from $dp[i - 1][j - 1]$ .

In this approach, the $dp$ array stores the results of each subproblem, gradually building up solutions to larger subproblems and eventually yielding the overall Levenshtein distance at its lower-right cell. The final state of the $dp$ array for the above example input strings $a$ = "bat" and $b$ = "bed" is shown below:

final filled 2D dp array

Pseudocode

Let's now delve into the pseudocode representation of the bottom-up dynamic programming approach for calculating the Levenshtein distance between two strings $a$ and $b$ .

By systematically filling the $dp$ array, this approach iteratively computes the Levenshtein distance for various sub-problems. The resulting $dp[n][m]$ value represents the Levenshtein distance between the given strings $a$ and $b$ .

// Example usage:
a = "bat"
b = "bed"
levenshteinDistance = minDistance(a, b)
print(levenshteinDistance)

function minDistance(a, b):
    n = length(a) // Length of string `a`
    m = length(b) // Length of string `b`
    
    // Create a 2D array `dp` to store subproblem results
    dp = array[n + 1][m + 1]

    // Initialize base cases
    for i from 0 to n:
        dp[i][0] = i
    for j from 0 to m:
        dp[0][j] = j
    
    // Compute the Levenshtein distance using bottom-up DP approach (iterative)
    for i from 1 to n:
        for j from 1 to m:            
            if a[i - 1] == b[j - 1]:
                // When the characters match
                dp[i][j] = dp[i - 1][j - 1]

            else: 
            // When the characters don't match            
            // Choose the minimum of three operations:
            // 1. Insertion: dp[i][j-1] 
            // 2. Deletion: dp[i-1][j]
            // 3. Substitution: dp[i-1][j-1]
            dp[i][j] = min(
                dp[i][j - 1],    // Insertion
                dp[i - 1][j],    // Deletion
                dp[i - 1][j - 1] // Substitution
            ) + 1

    // Return the Levenshtein distance between the strings `a` and `b`
    return dp[n][m]

Complexity analysis

Let $n$ be the length of string $a$ (source string), and $m$ be the length of string $b$ (destination string).

Time Complexity: The core of the algorithm consists of two nested loops. The outer loop iterates $n$ times, representing the rows of the dynamic programming table. The inner loop iterates $m$ times, representing the columns of the table. Within each iteration of the inner loop, constant-time operations are performed. Hence, the overall time complexity is $O(n \cdot m)$ , as each cell of the dynamic programming table (i.e. 2D array) $dp$ is computed in constant time.
Space Complexity: The space complexity is determined by the need to create a 2D array, often referred to as the $dp$ array, for storing the intermediate results of subproblems. The dimensions of this array are $(n + 1)$ rows and $(m+1)$ columns, corresponding to the lengths of the two input strings $a$ and $b$ . Therefore, the space required for the $dp$ array is $O(n \cdot m)$ , where $n$ and $m$ are the lengths of the two input strings, respectively.

The bottom-up dynamic programming approach's efficiency comes from its ability to eliminate redundancy by calculating and reusing values in a tabulated manner. This optimization allows it to efficiently handle larger input strings by effectively managing the computation and storage of subproblem solutions.

Space optimization

The bottom-up dynamic programming approach, also known as tabulation, offers an efficient solution for calculating the Levenshtein distance. However, there is an opportunity to optimize the algorithm's space complexity. The original implementation utilizes a 2D array to store subproblem results, consuming $O(n \cdot m)$ space. Through careful analysis, we can significantly reduce this space requirement while preserving the algorithm's efficacy.

The core idea behind space optimization lies in recognizing that during each iteration of the algorithm, we only need to retain and update the results of the previous row in the dynamic programming table. This insight enables the use of a 1D array to maintain the values of the previous row, resulting in a space complexity of $O(min(n, m))$ .

Here's the pseudocode of the space-optimized algorithm:

// Example usage:
a = "bat"
b = "bed"
levenshteinDistance = minDistance(a, b)
print(levenshteinDistance)

// Achieving linear space complexity by using a 1D array (`prevRow`) to store the previous row.
// Space complexity: O(k), where k = min(length(a), length(b))
function minDistance(a, b):
    n = length(a)
    m = length(b)

    // If `a` is smaller than `b`, we swap the words for optimization and recursively call the function with the swapped words. This is because the algorithm only handles the case when `a` is longer than `b`; otherwise, we may get `ArrayOutOfBoundsException` in some programming languages.
    if n < m:
        // Recursively call the function with the flipped words (i.e. `source` becomes `destination`, and vice versa)
        return minDistance(b, a);

    // Initialize the previous row with values [0, 1, 2, ..., m]
    prevRow = array[m]
    for i = 0 to m:
        prevRow[i] = i

    // Loop through each row from 1 to `n` (inclusive)
    for i = 1 to n:
        // Initialize a new row array
        row = array[m]
        row[0] = i // Initialize the first column
        
        // Loop through each column from 1 to `m` (inclusive)
        for j = 1 to m:
            // Calculate the minimum edit distance for the current cell
            insertionCost = 1 + prevRow[j] // Insertion cost
            deletionCost = 1 + row[j - 1] // Deletion cost
            substitutionCost = (0 if a[i - 1] == b[j - 1] else 1) + prevRow[j - 1] // Substitution cost
            
            // Calculate the minimum of the three possible operations
            val = min(insertionCost, deletionCost, substitutionCost)
            
            // Update the current cell with the minimum cost
            row[j] = val
            
        // Update the previous row with the current row
        prevRow = row

    // Return the bottom-right value of prevRow, representing minimum edit distance
    return prevRow[m]

By implementing this space optimization, we attain a memory-efficient solution that upholds the accuracy of Levenshtein distance calculation. The algorithm's space complexity is significantly reduced, rendering it suitable for scenarios where optimal memory utilization is a priority.

Conclusion

In this topic, we've delved into the bottom-up dynamic programming approach for calculating the Levenshtein distance. Let's recap the key points we've explored:

Levenshtein distance serves as a vital measure of string similarity, quantifying the minimum number of edit operations required to transform one string into another.
The naive recursive approach, while intuitive, becomes inefficient for longer strings due to the exponential growth of redundant calculations arising from overlapping subproblems.
Dynamic programming offers an efficient solution by breaking down the problem into smaller subproblems and reusing their solutions.
The bottom-up dynamic programming approach starts from simple subproblems and iteratively builds solutions for larger ones, effectively eliminating redundant calculations.
By using a 1D array to store only the previous row of the dynamic programming table, the bottom-up approach achieves linear space complexity.
Through our exploration, we've discovered that the time and space complexity of the bottom-up approach remains $O(n \cdot m)$ , where $n$ and $m$ represent the lengths of the input strings.

By grasping the concepts of this bottom-up approach, we've gained a crucial tool for efficiently tackling the Levenshtein distance problem. This method equips us to quantify string similarity accurately, benefiting applications ranging from spell-checking to genetic analysis and beyond.

4 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo