Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Knuth-Morris-Pratt algorithm

12 minutes read

The Knuth-Morris-Pratt (KMP) algorithm is an efficient approach to the substring searching problem. It is widely used in text editors, search engines, and data processing applications for tasks like finding keywords, searching for specific patterns, or filtering text. In this topic, we will learn this algorithm and see how it works by example. After completing the topic, you will be able to implement the algorithm and use it to efficiently solve various problems connected to substring searching.

Algorithm description

The KMP algorithm is similar to the naive substring searching approach: at each step, it compares a pattern with a substring of a text trying to find a complete match. The difference between these two algorithms is how they process the case of mismatched symbols. In the naive algorithm, we simply shift a pattern by one symbol and start a new iteration. In some cases, this may lead to many unnecessary comparisons. To process this case more efficiently, the KMP algorithm calculates the prefix function for the pattern and then uses it to shift the pattern more optimally.

Let's see in more detail how this step works for $s = BACBAD$ and $t = BACBACBAD$ . First, we calculate the prefix function $p$ for the pattern:

string and pattern

Then, we start searching for the pattern in the text:

First matching

At the first iteration, there is a mismatch in the last pair of symbols: $t[6] \ne s[6]$ . We can see that $s[1,5] = BACBA$ is a substring of the pattern that matches the beginning of the text. The length of this substring is $5$ and from the prefix function of $s$ , we see that the length of its longest border is $p[5] = 2$ . Now, please recall the definition of the border: It is the longest proper prefix that matches the suffix. Hence, $p[5] = 2$ tells us that the first and the last two characters of $s[1,5]$ are the same.

Border of substring

Hence, we may shift the pattern by $5-2 = 3$ symbols and continue the comparison with the $6$ th symbol of the text and the $3$ rd symbol of the pattern knowing that the previous symbols already match:

Second matching

The yellow symbols on the figure above correspond to the current position in the text and in the pattern. So, using the prefix function, we may find an optimal shift for the pattern avoiding comparisons that won't result in matching. This allows us to significantly reduce the total number of symbol comparisons.

For arbitrary pattern $s$ and text $t$ , the KMP algorithm can be formulated as follows:

Calculate the prefix function $p$ for the pattern $s$ .
Set the first substring of the text with length $|s|$ as current.
Compare the pattern with the current substring of the text.
- If all symbols match, save the index of the found occurrence.
- Otherwise, calculate the length $L$ of the matched substring. If $L \neq 0$ , shift the pattern by $L-p[L-1]$ symbols, else shift the pattern by $1$ symbol.
Continue step $3$ until all substrings of the text are processed. Then, return the found occurrences.

An example

Let's apply the Knuth-Morris-Pratt algorithm for a pattern $s = ABCABD$ in a text $t = ABCABCDABCABD$ . First, we need to calculate the prefix function $p$ for the pattern:

Prefix Function

Then, we start comparing the pattern with the substrings of the text:

First matching

At the first iteration, there is a mismatch in the last pair of symbols: $t[6] \ne s[6]$ . Since $p[5] = 2$ , we shift the pattern by $5-2 = 3$ symbols and start a new iteration:

Second matching

Now, there is a mismatch again: $t[7] \ne s[4]$ . Since $p[3] = 0$ , we shift the pattern by $3-0 = 3$ symbols and start a new iteration:

No match

Alas! Not a single match. For such a scenario, we need to shift the pattern by 1 symbol and start a new iteration:

Final match

Hurrah! All symbols of the pattern match with the current substring of the text, so an occurrence is found. Since all symbols of the text are processed, the algorithm is finished.

If you want to see a visualization of the KMP algorithm, you may take a look at this site.

Pseudocode

Here is the pseudocode of our KMP algorithm. In this pseudocode, we didn't show the implementation of compute_prefix_function(). Please refer to this topic on calculating prefix function for the pseudocode of this function.

function KMP(text, pattern):
    // Preprocessing: compute the prefix function
    prefixFunction = compute_prefix_function(pattern)

    // Initialize pointers for text and pattern
    textIndex = 1
    patternIndex = 1

    // Search the pattern in the text
    while (textIndex <= length(text)):
        if (text[textIndex] == pattern[patternIndex]):
            // If the current character of text and pattern matches, move to next character
            textIndex = textIndex + 1
            patternIndex = patternIndex + 1
        
        if (patternIndex == length(pattern) + 1):
            // If entire pattern is matched
            print ("Pattern found at index " + (textIndex - patternIndex + 1))
            // Shift the pattern for next possible match
            if (patternIndex > 1):
                patternIndex = prefixFunction[patternIndex - 1] + 1
            
        else if (textIndex <= length(text) and text[textIndex] != pattern[patternIndex]):
            // Mismatch after some matches
            if (patternIndex != 1):
                // Don't match the characters again which we have already matched
                patternIndex = prefixFunction[patternIndex - 1] + 1
            else:
                // If no match, move to next character in text
                textIndex = textIndex + 1

We encourage you to take a piece of paper and apply the pseudocode to the example above. One crucial thing to understand is that in the illustration we have shifted the pattern by $L-p[L-1]$ , which is equivalent to updating the $patternIndex$ using: $patternIndex = prefixFunction[patternIndex - 1] + 1$ . Please try it using the example and you will understand it in no time.

Complexity analysis

Assume we want to find a pattern $s$ in a text $t$ . First, we calculate the prefix function for $s$ , which requires $O(|s|)$ operations. After that, we start searching $s$ in $t$ performing a symbol-by-symbol comparison of the pattern with substrings of $t$ . Let's consider two scenarios:

If all the corresponding symbols match, we move to the next pair of symbols. The total count of such comparisons can't exceed $|t|$ , as this represents the maximum number of comparisons required when all characters from the text and the pattern are identical.
If we have a mismatch, we use the calculated prefix function to find an optimal shift for the pattern. In each case like that, the number of times we can shift the pattern is limited by the number of symbols that matched before. As a consequence of this optimal shift, the text pointer never goes back and re-examines the characters in the text. Therefore, the total number of such operations is no more than $|t|$ as well.

Thus, the total running time is $O(|s| + |t|)$ .

The algorithm requires $O(|s|)$ additional memory to store the prefix function for $s$ .

Conclusion

The KMP algorithm works in two main steps: preprocessing and searching.

In preprocessing, it calculates the pattern's prefix function to avoid needless comparisons during the search.
The search involves scanning the text, matching characters with the pattern, and moving pointers accordingly. If characters don't match, the pattern pointer is moved based on the prefix function, skipping over matched prefixes, while the text pointer stays put. The process continues until the entire text is scanned.

The benefit of the KMP algorithm is that it never re-examines text characters that have been involved in a comparison, skipping over them as much as possible. This makes it much more efficient than the naive method, especially for long texts and patterns.

7 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo