Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsString similarity

Hamming distance

4 minutes read

What is the difference between the words "dog" and "log"? Or "glow" and "blow"?

If you have ever played Hangman, a game where you have to guess a word (the game project is available on this platform), then you know this feeling of almost guessing a word, while only a single letter remains. You try several similar words and finally win or lose, depending on the amount of lives left. This topic, however, is not about a puzzle game, but rather than string similarity. What does this mean?

Sometimes to solve a problem we need to identify whether two strings are similar or not. For example, assume we want to compare two DNA sequences to understand how different two organisms are. Or we want to find a word in a document, both exact and approximate matches. To solve these problems, we need to come up with a metric for identifying the similarity between two strings and implement an algorithm for calculating the metric.

Currently, there exist several metrics, yet in this topic, we are going to cover one of the simplest metrics called the Hamming distance.

Description of the algorithm

The Hamming distance is the number of positions in which two strings of equal length are different. The approach is pretty straightforward: we take two strings and compare them letter by letter. If the letters are not the same, then we increment the hamming distance by one, otherwise the metric remains the same. This results in the minimal number of substitutions required to transform one string into the other.

Example

Let us start with a trivial example by comparing two short strings: "dog" and "log".

First example.

As we can see, "log" has a hamming distance of 1, because only one letter requires a substitution in order to get the original string "dog."

Now we can move to a more interesting example.

Second example.

By looking at this example we can say that these strings are similar, however, the letters are positioned differently. By calculating the hamming distance the order of letters is strict, which is why we get a minimal distance of 6.

The same can be done with sequences of bits.

Third example.

The hamming distance in this example is 3 because we need 3 substitutions to get the initial bit sequence.

Pseudocode

The approach is simple: first, we check, whether the strings are of equal length because hamming distance can only work with strings of equal length. We set our initial distance to zero we compare letter by letter in both strings and when there is a distinction, the distance is increased by 1.

Check, whether the strings are of equal length;
Set the initial distance to zero;
Loop through the indices 0 to the word's length (doesn't matter which one, because of equal length);
Increase (increment) the distance by 1, if the letters with the same indices differ in both strings.

function hammingDistance(string_1, string_2):
    if len(string_1) != len(string_2):
        raise Error("Input strings must have equal length")
    
    distance = 0
    
    for i in [0, len(string_1) - 1]:
        if string_1[i] != string_2[i]:
            distance = distance + 1
    
    return distance

Complexity and other features

In the previous segment, we looked at the pseudocode which featured only one for loop. This means that the algorithm's time complexity is equal to $O(n)$ , where $n$ is the length of strings. The algorithm requires no additional memory, making it very fast and efficient.

For instance, in information theory, hamming distance is used to measure the similarity of two binary codes, and in telecommunications to count the number of different bits in two binary words as an estimate of an error (some other examples may be found in a Wikipedia article). This leads us to the limitation of the algorithm, which is the inability to detect and resolve burst errors. This is a sequence of corrupted bits resulting in multiple-bit errors, usually occurring in serial data transmission.

Conclusion

We took a look at the Hamming distance and here's what we learned from this topic:

Hamming distance is the amount of substitutions in order to turn one string into the other;
Hamming distance has a complexity of $O(n)$ ;
Hamming distance follows 2 important criteria: same length and strict order;
Hamming distance is used for bit correction;
Hamming distance can only correct single-bit errors.

4 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo