Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsString similarity

Damerau–Levenshtein distance

9 minutes read

When it comes to security, things get very serious. Sometimes mistakes are made for a reason. Imagine there is a fraudster among your employees. To withdraw money from your company's account, she or he creates a fake bank account and a fake "State services" vendor. However, you usually work with "Estate services" vendor! This traitor will ask the company to send checks to the real and the fake vendors. The Damerau–Levenshtein algorithm will detect the transposed and dropped letter and attract the attention of fraud experts.

Despite the complicated name, the Damerau–Levenshtein distance is not harder than those we've already considered. In fact, it is the Levenshtein distance, but with one additional function, which we will talk about later.

The fourth operation

The definition of the Damerau–Levenshtein distance also includes transpositions among its possible operations. Thus, it has:

insertions;
deletions;
substitutions;
transpositions.

Let's recall what a transposition is. Basically, it is an exchange of two elements of an ordered list, with all other elements staying at the same places. If you transpose something, you change the order. In other words, transposition is a permutation of two elements. For example, swapping 2 and 3 in the list [1, 2, 3, 4, 5, 6] is a transposition, after which the list would look like [1, 3, 2, 4, 5, 6].

Swapping in the list

When working with the Damerau–Levenshtein distance, we need to keep transpositions between neighbor characters in mind as the fourth operation.

We can use the concept of the Damerau–Levenshtein distance not only with numbers: for words, the Damerau-Levenshtein distance is the minimum number of operations that we need to transform one word into another. These operations can include inserts, deletions, substitutions of one character, or a transposition of two adjacent characters.

Let's take a look at an example. We have two strings: CA and ABC. Using the Damerau–Levenshtein algorithm:

Damerau–Levenshtein algorithm: first step

The Damerau–Levenshtein algorithm provides transformation within 2 steps, but with the Levenshtein distance, we would have less optimal transformation with 3 steps:

Damerau–Levenshtein algorithm: second step

Formal definition

The Damerau–Levenshtein distance is defined as the recursive function $\displaystyle d_{a,b}(n, m)$ of two strings $a, b$ with the lengths $n, m$ , where $i, j$ are the indices of characters of these words, respectively:

$a=a_1...a_i...a_n \newline b=b_1...b_j...b_m \newline \newline d_{a,b}(i, j) = min \begin{cases} 0 & \quad \text{if } i=j=0\\ d_{a,b}(i-1,j)+1 & \quad \text{if } i>0 \\ d_{a,b}(i,j-1)+1 & \quad \text{if } j>0 \\ d_{a,b}(i-1,j-1) & \quad \text{if } i,j>0 \text{ and } a_i=b_j \\ d_{a,b}(i-1,j-1) +1& \quad \text{if } i,j>0 \text{ and } a_i \neq b_j \\ d_{a,b}(i-2,j-2)+1 & \quad \text{if } i,j>0 \text{ and } a_i = b_{j-1} \text{ and } a_{i-1}=b_j \\ \end{cases}$

For a better understanding, have a look at an illustration of the algorithm:

Illustration of the algorithm

To calculate the Damerau–Levenshtein distance, we need to fill the distance matrix with values. The idea is the following:

Define a distance matrix, where the number of rows and columns is equal to the length of the provided strings.
Add an extra row and column that will hold characters' indices to count.
Calculate the values of the function $d_{a, b}(i,j)$ , which depend on the row $i$ and the column $j$ , using the formula.
Choose the minimum among all possible $d_{a, b}(i,j)$ values.
Fill up the cell at the intersection of the column and the row with the found minimum value.
After filling up the matrix, the score in the lower-right cell will be the answer.

Let's consider how this function would work on a real example: let our words be "Saturday" and "Sunday". You will see that there is no problem with strings of different lengths. This one will have 8 columns and 6 rows.

You can see that the numbering in the distance matrix starts with one, but the formula includes zeros. Keep in mind that we count characters in the words from one, but $i$ and $j$ can be zeros.

The first 2×2 matrix (let's call it a square) to be considered is $\begin{pmatrix} 0 & 1\\ 1 & ?\\ \end{pmatrix}$ :

the distance matrix

Here, i and j are ones, while we consider the first row and the first column, so there are three options:

1. i > 0. This is a case of deletion:

$d_{a, b}(i-1, j) + 1 = d_{saturday, sunday}(1 - 1, 1) + 1 = d_{saturday, sunday}(0, 1) + 1 = 0+1 =1$

2. j > 0. A case of insertion:

$d_{a, b}(i, j-1) + 1 = d_{saturday, sunday}(1, 1-1) + 1 = d_{saturday, sunday}(1, 0) + 1 = [\text{case of } i>0]= (d_{saturday, sunday}(1-1, 0) + 1)+1= (d_{saturday, sunday}(0, 0) + 1)+1=(0+1)+1=2$

3. i, j > 0 and $a_i = b_j$ :

$d_{a, b}(i-1, j-1) = d_{saturday, sunday}(1-1, 1-1) = d_{saturday, sunday}(0, 0)=\color{red}0$

Now, we need to choose the minimum among the values we have calculated. Among $1, 2$ , and $0$ the last one is the smallest.

Here we have faced the matching of the respective symbols, so the distance between them is zero.

matching of the respective symbols

In continuation, we repeat the same process: pick a 2x2 square with three filled cells and calculate the value for the bottom-right cell. For the square $\begin{pmatrix} 1 & 2\\ 0 & ?\\ \end{pmatrix}$ it will be 1:

Firstly, i = 2 and j = 1, because we consider the cell in the second column and in the first row. For these indices, characters in the words are 'a' and 's', respectively.

1. i > 0:

$d_{a, b}(i-1, j) + 1 = d_{saturday, sunday}(2 - 1, 1) + 1 = d_{saturday, sunday}(1, 1) + 1 = 0+1 =\color{red}1$

2. j > 0: $d_{a, b}(i, j-1) + 1 = d_{saturday, sunday}(2, 1-1) + 1 = d_{saturday, sunday}(2, 0) + 1 = [\text{case of } i>0]= (d_{saturday, sunday}(2-1, 0) + 1)+1= 0+1+1=2$

3. i, j > 0 and $a_i \neq b_j$ :

$d_{a, b}(i-1, j-1) + 1 = d_{saturday, sunday}(2-1, 1-1) + 1 = d_{saturday, sunday}(1, 0) + 1 = [\text{case of } i>0]= (d_{saturday, sunday}(0, 0) + 1)+1= 0+1=\color{red}1$

You can notice that here both deletion and mismatching of the symbols give us the same answer. It means that it doesn't matter which operation we use: both of them fit the situation.

filling the distance matrix

filling the distance matrix, part two

With a few more iterations, we will obtain a full matrix. Now let's calculate the last step. We need to consider the square $\begin{pmatrix} 3 & 4\\ 4 & ?\\ \end{pmatrix}$ .

Full matrix

At first, we have i = 8 and j = 6, and for them, current characters are 'y' for both words. For this case, three options are possible:

1. i > 0:

$d_{a, b}(i-1, j) + 1 = d_{saturday, sunday}(8 - 1, 6) + 1 = d_{saturday, sunday}(7, 6) + 1 = 4 + 1 = 5$

2. j > 0:

$d_{a, b}(i, j-1) + 1 = d_{saturday, sunday}(8, 6-1) + 1 = d_{saturday, sunday}(8, 5) + 1 = 4+1=5$

3. i, j > 0 and $a_i=b_j$ :

$d_{a, b}(i-1, j-1) = d_{saturday, sunday}(8-1, 6-1) = d_{saturday, sunday}(7, 5) = \color{red}3$

The answer is in the blue cell: the Damerau-Levenshtein distance between "Saturday" and "Sunday" is 3.

Pseudocode implementation

Now let's implement the function to calculate Damerau-Levenshtein distance.

1. Firstly, we need to check if one of the strings is empty. For instance, if we have " " (an empty string) and "cat", we need 3 steps to convert " " to "cat" by insertion of "c", "a" and "t".

function damerau_levenshtein(a, b):
    if length(a) = 0 then:
        return length(b)
    if length(b) = 0 then:
        return length(a)

2. For the next step, we need to build a distance matrix and add an extra row and column that will hold characters' indices to count:

    for i = 0 to length(a):
        d[i, 0] = i
    for j = 1 to length(b):
        d[0, j] = j

On this step, our matrix looks like this:

On this step, our matrix looks like thi

We have simplified our task: there is no necessity to calculate the first column and row. They always consist of integer numbers starting with zero and ending with the lengths of a row or a column.

3. The current step is to calculate the remaining part of the distance matrix. With the formula for Damerau-Levenshtein distance, we will have the following realization:

    for i = 1 to length(a):
        for j = 1 to length(b):
            if a[i - 1] = b[j - 1] then:
                d[i][j] = d[i - 1][j - 1] // it is a match of respective symbols
            else:
                d[i][j] = min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1 // deletion, insertion or mismatch
                if i > 1 and j > 1 then:
                    if (a[i - 1] = b[j - 2]) and (a[i - 2] = b[j - 1]) then:
                        d[i][j] = min(d[i][j], d[i - 2][j - 2] + 1) // a transposition

You can notice that not all indices correspond to the formula. For instance, in the case of transposition we have: $\text{ if (} i,j>0 \text{) and } (a_i = b_{j-1}) \text{ and } (a_{i-1}=b_j)$ .

However, in the code we have:

if (a[i - 1] = b[j - 2]) and (a[i - 2] and b[j - 1]) then

This is due to the first step we have conducted. As you remember, we have already filled the first column and the first row.

The entire implementation would look as follows:

function damerau_levenshtein(a, b):
    // 1. check for empty strings
    if length(a) = 0 then:
        return length(b)
    if length(b) = 0 then:
        return length(a)

    // 2. create the distance matrix
    for i = 0 to length(a):
        d[i, 0] = i
    for j = 1 to length(b):
        d[0, j] = j

    // 3. calculate distance matrix
    for i = 1 to length(a):
        for j = 1 to length(b):
            if a[i - 1] = b[j - 1] then:
                d[i][j] = d[i - 1][j - 1] // it is a match of respective symbols
            else:
                d[i][j] = min(d[i - 1][j], d[i][j - 1], d[i - 1][j - 1]) + 1 // deletion, insertion or mismatch
                if i > 1 and j > 1 then:
                    if (a[i - 1] = b[j - 2]) and (a[i - 2] = b[j - 1]) then:
                        d[i][j] = min(d[i][j], d[i - 2][j - 2] + 1) // a transposition

    return d[length(a)][length(b)] // this is the lower-right cell

The output of this code will be:

>>> print(damerau_levenshtein("Saturday", "Sunday"))
>>> 3

Time complexity of the algorithm

To reveal the complexity of a given algorithm, we have to consider all the boundary cases. If we take an example where all characters in two strings are different but these strings have the same lengths, we will get a maximum number of steps, depending on the strings' lengths. So, for "cat" and "dog" the distance will be 3.

If we take a look at our pseudocode, especially on the first lines of the main step, we will see:

for i = 1 to length(a):
        for j = 1 to length(b):

Why do we have such construction?

As you remember, we have used an illustration of the matrix. There were as many rows as there were characters in one word, and as many columns as there were characters in another word. Hence, we had to fill in the table with $length(a) \times length(b)$ number of cells. That's why the question of the complexity is not as difficult as it seems: we just need to calculate all the values in the table with $length(a) \times length(b)$ size to get the Damerau-Levenshtein distance. Now, we can say with confidence that the time complexity of this algorithm is $O(m \times n)$ , where $m$ is the length of the first string, and $n$ is the length of the second one.

It is always good practice to implement code by yourself, because it will provide a deeper understanding. However, there are a few libraries (e.g. in Python language you can use NLTK and TextDistance) that allow the usage of metric functions without coding them from scratch.

Conclusion

The Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions, or substitutions of a single character, or transposition of two adjacent characters) required to transform one word into another. It is a bit more sensitive way to calculate the distance between two strings than the Levenshtein algorithm, and it also counts through the distance matrix. The complexity of this algorithm is $O(m \times n)$ , where $m$ is the length of the first string and $n$ is the length of the second one.

The Damerau–Levenshtein distance plays an important role in natural language processing. Also, it is a suitable metric of the variation between two strands of DNA, fraud detection, and even export control

14 learners liked this piece of theory. 2 didn't like it. What about you?

Report a typo