Seek not greatness, but seek truth and you will find both.
Horace Mann
Introduction
Strings are a fundamental data structure in programming, and knowing how to handle them is essential. Substring search, the process of finding a specific pattern within a larger text, is a skill that programmers employ very often. Whether you're searching for occurrences of a word in a text editor or seeking a particular phrase on a web page, substring search algorithms can help you efficiently locate and extract the desired substrings.
In this topic, we will explore a variety of substring search algorithms that are invaluable for software developers. Each algorithm possesses unique characteristics and strengths, providing different approaches to solving substring search problems. By understanding these algorithms, you will gain powerful tools for efficiently finding patterns within texts.
Algorithms description and pseudocode
In this section, we will delve into the core principles and implementations of various substring search algorithms. We will explore each algorithm's approach to efficiently search for patterns within texts, providing you with a comprehensive understanding of their inner workings. By examining the code snippets and explanations, you will gain practical knowledge of how to apply these algorithms in real-life scenarios. Let's begin our exploration of the different substring search algorithms and their implementations. The following notation is used: is pattern length, is text string length, and is the number of times pattern occurs in the text.
Naïve substring search. This algorithm is a simple and intuitive but inefficient approach to substring search. It involves consequently comparing the pattern with all possible substrings of the text. While straightforward to implement, naïve search has a time complexity of , which makes it the slowest one among other substring search algorithms.
Pseudocode:
function naive_substring_search(text, pattern): n = len(text) m = len(pattern) for i in [0, n - m + 1]: j = 0 while j < m: if (text[i + j] != pattern[j]) then break j += 1 if (j == m) then return i return -1Prefix function search. Also known as the "Z-algorithm", prefix function search constructs an array that stores the length of the longest common prefix between a pattern and all its suffixes. This array, called the "Z-array," enables efficient substring search by utilizing the prefix information. Prefix function algorithm has a time complexity of .
Pseudocode:
function compute_prefix_function(string): k = len(string) p = array[1, k] p[1] = 0 for i in [2, k]: j = p[i - 1] while (j > 0) and (string[j + 1] != string[i]): j = p[j] if (string[j + 1] == string[i]) then p[i] = j + 1 else p[i] = 0 return p function prefix_function_search(text, pattern): n = len(text) m = len(pattern) combined_string = pattern + '$' + text z_array = compute_prefix_function(combined_string) for i in [m + 1, n + m + 1]: if (z_array[i] == m) then return i - m - 1 return -1Knuth-Morris-Pratt algorithm. The Knuth-Morris-Pratt (KMP) algorithm improves upon the naïve substring search by utilizing prefix function. This allows the algorithm to skip unnecessary comparisons by shifting the next search position exactly be the difference that's not been compared yet. Its runtime complexity is .
Pseudocode:
function KMP(text, pattern): //preprocessing: compute the prefix function using the code above prefix_function = compute_prefix_function(pattern) //initialize index counters for text and pattern text_index = 0 pattern_index = 0 //search the pattern in the text while text_index < len(text): if (text[text_index] == pattern[pattern_index]) then //if the current character of text and pattern matches, move to next character text_index += 1 pattern_index += 1 if (pattern_index == len(pattern)) then //if the entire pattern is matched print ("Pattern found at index " + (text_index - pattern_index + 1)) //shift the pattern for next possible match if (pattern_index > 1) then pattern_index = prefix_function[pattern_index - 1] + 1 else if (text_index <= len(text) and text[text_index] != pattern[pattern_index]) then //mismatch after some matches if (pattern_index != 1) then //don't match the characters again which we have already matched pattern_index = prefix_function[pattern_index - 1] + 1 else: //if no match, move to next character in text text_index += 1Rabin-Karp algorithm utilizes hashing to efficiently search for substrings. It computes hash values for the pattern and all possible substrings of the text, using rolling hashes and a sliding window technique, comparing the hash values for potential matches. For a more detailed description and further explanations, you can read our special topic on this algorithm. Time complexity: on average.
Pseudocode:
d = 95 //the number of characters in the input alphabet (we assumed basic latin ASCII printable characters) q = 37 //some prime number for polynomial hashing function rabin_karp(text, pattern, q): m = len(pattern) n = len(text) i = 0 j = 0 pattern_hash = 0 //hash value for pattern text_hash = 0 //hash value for text h = 1 //calculating the auxiliary term for rolling hash for i in [0, m - 1]: h = (h*d) % q //the value of h would be "pow(d, M-1) % q" //polynomial hashing for pattern and first window of text for i in [1, m]: //ord() is the function mapping characters to their corresponding ASCII codes (integers) p = (d*p + ord(pattern[i])) % q t = (d*t + ord(text[i])) % q //slide the pattern over text one by one for i in [1, n - m + 1]: //check the hash values of the current text window and pattern //if the hash values match then only check for characters one by one if (p == t) then //check for characters one by one for j in [1, m]: if (text[i+j] != pattern[j]) then break else: j += 1 //if p == t and pattern[0, ..., m - 1] = text[i, i + 1, ..., i + m - 1] if (j == m) then print("Pattern found at index " + i) //calculate hash value for next window of text: //remove leading digit and add trailing digit, using h-term if (i < n - m) then t = (d*(t - ord(text[i])*h) + ord(text[i + m])) % q //we might get negative values of t, converting it to positive if t < 0: t = t+qBoyer-Moore algorithm is an efficient substring search algorithm that utilizes two main heuristics: the "bad character rule" and the "good suffix rule". These rules enable skipping comparisons by analyzing mismatches and taking advantage of previously matched characters. The Boyer-Moore algorithm has a time complexity of in the average case.
function build_bad_character_table(pattern): m = len(pattern) bad_character_table = {} //constructing an empty dictionary for mapping characters to integers for i in [0, m - 1]: bad_character_table[pattern[i]] = i return bad_character_table function build_good_suffix_table(pattern): m = len(pattern) j = 0 //intializing arrays with zeroes suffixes = [0] * m good_suffix_table = [0] * m for i in [m - 2, -1]: if ((i > j) and (suffixes[i + m - 1 - j] < i - j)) then suffixes[i] = suffixes[i + m - 1 - j] else if (i < j) then j = i while ((j >= 0) and (pattern[j] == pattern[j + m - 1 - i])): j -= 1 suffixes[i] = i - j for i in [1, m]: good_suffix_table[i] = m - suffixes[m - 1] for i in [1, m - 1]: good_suffix_table[m - 1 - suffixes[i]] = m - 1 - i return good_suffix_table function boyer_moore(text, pattern): n = len(text) m = len(pattern) //If the pattern is empty return 0 if (m == 0) then return 0 //Preprocessing the pattern bad_char_table = Make_Bad_Char_Table(pattern) good_suffix_table = Create_Good_Suffix_Table(pattern) text_pointer = m //Search from the end of the pattern while (text_pointer <= n): pattern_pointer = m //If current characters match and there are characters left for matching in the pattern //shift both pointers to the left by one character while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]): text_pointer = text_pointer - 1 pattern_pointer = pattern_pointer - 1 //there are no characters left for matching in the pattern if (pattern_pointer == 0) then print("Pattern found at ", text_pointer + 1) // Match found text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences continue //Determining shift size //Bad character rule if text[text_pointer] in bad_char_table: bad_char_shift = bad_char_table[text[text_pointer]] //If not present, the shift size is the length of the pattern else: bad_char_shift = m //Good suffix rule suffix_length = m - pattern_pointer //Length of the matched suffix good_suffix_shift = good_suffix_table[suffix_length] //Look up for the shift size //Update the text pointer text_pointer = text_pointer + max(bad_char_shift, good_suffix_shift)Aho-Corasick algorithm is a string-matching algorithm that efficiently searches for multiple patterns simultaneously. It constructs a finite-state machine called the "Aho-Corasick automaton" to perform substring search efficiently.
This algorithm has a runtime complexity of . To find out more, explore our introductory and advanced topics on it.
When to use each algorithm
By understanding the strengths and applications of each algorithm, you can make informed decisions about which one to use based on your specific requirements. Different algorithms serve different purposes and are more suitable for specific scenarios. Here are some guidelines on when to use each algorithm:
Naïve Substring Search. As you have already noticed, it is generally inefficient for most applications, except, perhaps, educational and teaching purposes. When developing a real hands-on code, please refrain from using it.
Prefix Function is particularly useful when dealing with patterns that contain repeated substrings. It helps efficiently identify and utilize these repetitions, making it a valuable choice in such cases.
Knuth-Morris-Pratt Algorithm: The KMP algorithm is generally quite efficient and therefore is widely used and exploited by various real-life implementations of searching engines.
Rabin-Karp Algorithm. If your application requires multiple pattern searches, the Rabin-Karp algorithm is a great choice. It excels in scenarios where you need to find occurrences of a pattern in a text multiple times, providing an efficient solution.
Boyer-Moore Algorithm: When dealing with large texts and complex patterns, the Boyer-Moore algorithm is a reliable option. It employs a powerful strategy that skips unnecessary comparisons, resulting in faster searches and making it suitable for such challenging scenarios.
Aho-Corasick Algorithm. Searching for a large number of patterns within a single pass is where the Aho-Corasick algorithm shines. It efficiently handles multiple pattern searches simultaneously, making it an excellent choice in scenarios where you need to find numerous patterns efficiently.
Comparison of algorithms
Here, we are going to compare the runtime complexities of the aforementioned substring searches, highlighting their best-case, worst-case, and average-case scenarios. Understanding these complexities will provide insights into the efficiency and performance of each algorithm. Have a look at this nicely compilation which compares all of them at once. For convenience, we exploit the same notation as above: is the length of text string, is the length of pattern, and is the number of pattern occurences in the text.
Algorithm | Average case | Worst case | When to use |
Naïve Substring Search | ONLY educational purposes | ||
Prefix Function | Patterns with repeated parts | ||
Knuth-Morris-Pratt | Generally efficient and good | ||
Rabin-Karp | Multiple pattern searches | ||
Boyer-Moore | Huge texts and complex patterns | ||
Aho-Corasick | Large number of patterns in a single pass of text string |
Conclusion
In this topic, we explored various substring search algorithms that are essential for software developers. We started with the Naïve Substring Search algorithm, which provides a simple but inefficient approach. Then, we delved into more advanced algorithms like the Prefix Function Search, Knuth-Morris-Pratt (KMP) algorithm, Rabin-Karp algorithm, Boyer-Moore algorithm, and Aho-Corasick algorithm. Each algorithm offers unique advantages and is suitable for different scenarios.
By mastering these algorithms, you now have a powerful set of tools to efficiently search for substrings in texts. Whether you're working on a text editor, search engine, or any application that involves pattern matching, understanding these algorithms will significantly improve your ability to solve substring search problems effectively.
The key to successfully using these algorithms lies in understanding their underlying principles and knowing when to apply each one. Keep practicing and applying these concepts in real-world scenarios to strengthen your knowledge and problem-solving skills, for example, in the forthcoming task steps. See you there!