The Knuth-Morris-Pratt (KMP) algorithm is an efficient approach to the substring searching problem. It is widely used in text editors, search engines, and data processing applications for tasks like finding keywords, searching for specific patterns, or filtering text. In this topic, we will learn this algorithm and see how it works by example. After completing the topic, you will be able to implement the algorithm and use it to efficiently solve various problems connected to substring searching.
Algorithm description
The KMP algorithm is similar to the naive substring searching approach: at each step, it compares a pattern with a substring of a text trying to find a complete match. The difference between these two algorithms is how they process the case of mismatched symbols. In the naive algorithm, we simply shift a pattern by one symbol and start a new iteration. In some cases, this may lead to many unnecessary comparisons. To process this case more efficiently, the KMP algorithm calculates the prefix function for the pattern and then uses it to shift the pattern more optimally.
Let's see in more detail how this step works for and . First, we calculate the prefix function for the pattern:
Then, we start searching for the pattern in the text:
At the first iteration, there is a mismatch in the last pair of symbols: . We can see that is a substring of the pattern that matches the beginning of the text. The length of this substring is and from the prefix function of , we see that the length of its longest border is . Now, please recall the definition of the border: It is the longest proper prefix that matches the suffix. Hence, tells us that the first and the last two characters of are the same.
Hence, we may shift the pattern by symbols and continue the comparison with the th symbol of the text and the rd symbol of the pattern knowing that the previous symbols already match:
The yellow symbols on the figure above correspond to the current position in the text and in the pattern. So, using the prefix function, we may find an optimal shift for the pattern avoiding comparisons that won't result in matching. This allows us to significantly reduce the total number of symbol comparisons.
For arbitrary pattern and text , the KMP algorithm can be formulated as follows:
Calculate the prefix function for the pattern .
Set the first substring of the text with length as current.
Compare the pattern with the current substring of the text.
If all symbols match, save the index of the found occurrence.
Otherwise, calculate the length of the matched substring. If , shift the pattern by symbols, else shift the pattern by symbol.
Continue step until all substrings of the text are processed. Then, return the found occurrences.
An example
Let's apply the Knuth-Morris-Pratt algorithm for a pattern in a text . First, we need to calculate the prefix function for the pattern:
Then, we start comparing the pattern with the substrings of the text:
At the first iteration, there is a mismatch in the last pair of symbols: . Since , we shift the pattern by symbols and start a new iteration:
Now, there is a mismatch again: . Since , we shift the pattern by symbols and start a new iteration:
Alas! Not a single match. For such a scenario, we need to shift the pattern by 1 symbol and start a new iteration:
Hurrah! All symbols of the pattern match with the current substring of the text, so an occurrence is found. Since all symbols of the text are processed, the algorithm is finished.
If you want to see a visualization of the KMP algorithm, you may take a look at this site.
Pseudocode
Here is the pseudocode of our KMP algorithm. In this pseudocode, we didn't show the implementation of compute_prefix_function(). Please refer to this topic on calculating prefix function for the pseudocode of this function.
function KMP(text, pattern):
// Preprocessing: compute the prefix function
prefixFunction = compute_prefix_function(pattern)
// Initialize pointers for text and pattern
textIndex = 1
patternIndex = 1
// Search the pattern in the text
while (textIndex <= length(text)):
if (text[textIndex] == pattern[patternIndex]):
// If the current character of text and pattern matches, move to next character
textIndex = textIndex + 1
patternIndex = patternIndex + 1
if (patternIndex == length(pattern) + 1):
// If entire pattern is matched
print ("Pattern found at index " + (textIndex - patternIndex + 1))
// Shift the pattern for next possible match
if (patternIndex > 1):
patternIndex = prefixFunction[patternIndex - 1] + 1
else if (textIndex <= length(text) and text[textIndex] != pattern[patternIndex]):
// Mismatch after some matches
if (patternIndex != 1):
// Don't match the characters again which we have already matched
patternIndex = prefixFunction[patternIndex - 1] + 1
else:
// If no match, move to next character in text
textIndex = textIndex + 1
We encourage you to take a piece of paper and apply the pseudocode to the example above. One crucial thing to understand is that in the illustration we have shifted the pattern by , which is equivalent to updating the using: . Please try it using the example and you will understand it in no time.
Complexity analysis
Assume we want to find a pattern in a text . First, we calculate the prefix function for , which requires operations. After that, we start searching in performing a symbol-by-symbol comparison of the pattern with substrings of . Let's consider two scenarios:
If all the corresponding symbols match, we move to the next pair of symbols. The total count of such comparisons can't exceed , as this represents the maximum number of comparisons required when all characters from the text and the pattern are identical.
If we have a mismatch, we use the calculated prefix function to find an optimal shift for the pattern. In each case like that, the number of times we can shift the pattern is limited by the number of symbols that matched before. As a consequence of this optimal shift, the text pointer never goes back and re-examines the characters in the text. Therefore, the total number of such operations is no more than as well.
Thus, the total running time is .
The algorithm requires additional memory to store the prefix function for .
Conclusion
The KMP algorithm works in two main steps: preprocessing and searching.
In preprocessing, it calculates the pattern's prefix function to avoid needless comparisons during the search.
The search involves scanning the text, matching characters with the pattern, and moving pointers accordingly. If characters don't match, the pattern pointer is moved based on the prefix function, skipping over matched prefixes, while the text pointer stays put. The process continues until the entire text is scanned.
The benefit of the KMP algorithm is that it never re-examines text characters that have been involved in a comparison, skipping over them as much as possible. This makes it much more efficient than the naive method, especially for long texts and patterns.