Introduction
Recall that for two strings and , the edit distance is the minimum number of insertions, deletions, or substitutions required to transform into . For now, we learned the algorithm that allows finding only the number of such operations. But sometimes, we want to know the exact sequence of transformations corresponding to the edit distance.
For example, suppose we want to compare a DNA sequence corresponding to some gene with a reference DNA to find mutations that happened in the gene. In this case, it is essential to know the exact sequence of transformations since different mutation scenarios may lead to different outcomes.
Or, assume we are interested in the history of some language and want to understand how one word originated from another. To do this, we can try to compare the words using the edit distance metric. In this case, the exact sequence of transformations is crucial as well.
So, having the exact order of edit operations is essential in some cases. Let's consider an algorithm that allows finding these operations for two arbitrary strings.
Reconstructing edit operations
For two arbitrary strings, we can reconstruct the sequence of edit operations using a table of intermediate answers. Recall that for and the table looks like this:
To reconstruct the edit operations, we can start from the lower right cell of the table (that contains ) and move to the adjacent (upper, left, or diagonal) cell corresponding to the optimal transformation. We consider the string in the first column as the initial string and the string in the first row as the final result. Therefore, a move to the upper cell corresponds to deletion, a move to the left cell corresponds to insertion, and a diagonal movement corresponds to substitution. We need to continue moving through the table until we reach the upper-left cell. Then, considering all the operations in the reverse order, we will get the sequence of transformations corresponding to the edit distance.
An example
Now let's see how we can reconstruct the edit operations for the example above. We begin from the lower right cell of the table:
In this case, we might get the final value either from the upper or from the diagonal cell. So, let's move to the diagonal cell (this movement corresponds to the substitution of "e" by "n"):
Now we can get the current value only from the cell on the diagonal (this movement corresponds to the substitution of "n" by "w"):
Again, we can get the current value only from the cell on the diagonal. Note that in this case, the corresponding symbols are the same, so the cost of this substitution is .
At this step, we can get the current value only from the cell on the left. This movement corresponds to the insertion of "r" after "b":
Finally, we need to perform a substitution (again with zero cost, since the corresponding symbols are the same):
So, to transform the word bone into the word brown, we need to perform a substitution first, then insert "r" after "b" and finally do three substitutions for the symbols "o", "n", and "e". Skipping the substitutions of the equal characters, we can represent the edit operations in the following way:
bone -> [insert "r" after "b"] ->
brone -> [substitute "n" by "w"] ->
browe -> [substitute "e" by "n"] ->
brown
At some steps, there are several optimal ways to choose an edit operation, meaning there may exist several optimal ways to transform one string into another.
Alignment
A convenient way to represent a sequence of edit operations is by using a so-called alignment. The following alignment corresponds to the transformation sequence above:
The first row in this table corresponds to the initial word (bone), and the second row corresponds to the final word (brown). A column with a gap symbol in the upper cell and a letter in the lower corresponds to insertion. The opposite case corresponds to deletion. The columns having no gaps correspond to a substitution.
For the example above, the second column corresponds to insertion, and the last two columns are substitutions. The first and the third columns are substitutions having zero cost.
Since there may exist several ways to transform one string into another, the strings may have several alignments as well. An alignment corresponding to the optimal sequence of edit operations we will further call an optimal alignment.
Complexity analysis
Assuming that for two strings and a table of intermediate answers is calculated, the sequence of edit operations can be reconstructed in , since on each step we process at least one symbol from one of the strings, and the total number of symbols is .
In this case, we need to store the whole table of intermediate answers to reconstruct the operations, so the algorithm requires additional memory. However, it can be reduced to . Since the algorithm is rather complicated to be fully explained here, those who are interested may take a look at a corresponding Wikipedia article.