Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Substring search

4 minutes read

Seek not greatness, but seek truth and you will find both.

Horace Mann

Introduction

Strings are a fundamental data structure in programming, and knowing how to handle them is essential. Substring search, the process of finding a specific pattern within a larger text, is a skill that programmers employ very often. Whether you're searching for occurrences of a word in a text editor or seeking a particular phrase on a web page, substring search algorithms can help you efficiently locate and extract the desired substrings.

In this topic, we will explore a variety of substring search algorithms that are invaluable for software developers. Each algorithm possesses unique characteristics and strengths, providing different approaches to solving substring search problems. By understanding these algorithms, you will gain powerful tools for efficiently finding patterns within texts.

Algorithms description and pseudocode

In this section, we will delve into the core principles and implementations of various substring search algorithms. We will explore each algorithm's approach to efficiently search for patterns within texts, providing you with a comprehensive understanding of their inner workings. By examining the code snippets and explanations, you will gain practical knowledge of how to apply these algorithms in real-life scenarios. Let's begin our exploration of the different substring search algorithms and their implementations. The following notation is used: $m$ is pattern length, $n$ is text string length, and $k$ is the number of times pattern occurs in the text.

Naïve substring search. This algorithm is a simple and intuitive but inefficient approach to substring search. It involves consequently comparing the pattern with all possible substrings of the text. While straightforward to implement, naïve search has a time complexity of $O(n\times m)$ , which makes it the slowest one among other substring search algorithms.

substring search: naive approach

Pseudocode:

function naive_substring_search(text, pattern):
    n = len(text)
    m = len(pattern)
    for i in [0, n - m + 1]:
            j = 0
            while j < m:
                if (text[i + j] != pattern[j]) then
                    break
                j += 1

            if (j == m) then
                return i
    return -1

Prefix function search. Also known as the "Z-algorithm", prefix function search constructs an array that stores the length of the longest common prefix between a pattern and all its suffixes. This array, called the "Z-array," enables efficient substring search by utilizing the prefix information. Prefix function algorithm has a time complexity of $O(n + m)$ .

prefix function calulation

Pseudocode:

function compute_prefix_function(string):
    k = len(string)
    p = array[1, k]
    p[1] = 0

    for i in [2, k]:
        j = p[i - 1]

        while (j > 0) and (string[j + 1] != string[i]):
            j = p[j]
        
        if (string[j + 1] == string[i]) then
            p[i] = j + 1
        else
            p[i] = 0
    return p

function prefix_function_search(text, pattern):
    n = len(text)
    m = len(pattern)
    combined_string = pattern + '$' + text
    z_array = compute_prefix_function(combined_string)
    
    for i in [m + 1, n + m + 1]:
        if (z_array[i] == m) then
            return i - m - 1
    
    return -1

Knuth-Morris-Pratt algorithm. The Knuth-Morris-Pratt (KMP) algorithm improves upon the naïve substring search by utilizing prefix function. This allows the algorithm to skip unnecessary comparisons by shifting the next search position exactly be the difference that's not been compared yet. Its runtime complexity is $O(n + m)$ .

KMP step illustration

Pseudocode:

function KMP(text, pattern):
    //preprocessing: compute the prefix function using the code above
    prefix_function = compute_prefix_function(pattern)

    //initialize index counters for text and pattern
    text_index = 0
    pattern_index = 0

    //search the pattern in the text
    while text_index < len(text):
        if (text[text_index] == pattern[pattern_index]) then
            //if the current character of text and pattern matches, move to next character
            text_index += 1
            pattern_index += 1
        
        if (pattern_index == len(pattern)) then
            //if the entire pattern is matched
            print ("Pattern found at index " + (text_index - pattern_index + 1))
            //shift the pattern for next possible match
            if (pattern_index > 1) then
                pattern_index = prefix_function[pattern_index - 1] + 1
            
        else if (text_index <= len(text) and text[text_index] != pattern[pattern_index]) then
            //mismatch after some matches
            if (pattern_index != 1) then
                //don't match the characters again which we have already matched
                pattern_index = prefix_function[pattern_index - 1] + 1
            else:
                //if no match, move to next character in text
                text_index += 1

Rabin-Karp algorithm utilizes hashing to efficiently search for substrings. It computes hash values for the pattern and all possible substrings of the text, using rolling hashes and a sliding window technique, comparing the hash values for potential matches. For a more detailed description and further explanations, you can read our special topic on this algorithm. Time complexity: $O(n + m)$ on average.

Rabin-Karp rolling hash

Pseudocode:

d = 95 //the number of characters in the input alphabet (we assumed basic latin ASCII printable characters)
q = 37 //some prime number for polynomial hashing

function rabin_karp(text, pattern, q):
    m = len(pattern)
    n = len(text)
    i = 0
    j = 0
    pattern_hash = 0    //hash value for pattern
    text_hash = 0       //hash value for text
    h = 1
    
    //calculating the auxiliary term for rolling hash
    for i in [0, m - 1]:
        h = (h*d) % q   //the value of h would be "pow(d, M-1) % q"
 
    //polynomial hashing for pattern and first window of text
    for i in [1, m]:
    //ord() is the function mapping characters to their corresponding ASCII codes (integers)
        p = (d*p + ord(pattern[i])) % q
        t = (d*t + ord(text[i])) % q
 
    //slide the pattern over text one by one
    for i in [1, n - m + 1]:
        //check the hash values of the current text window and pattern
        //if the hash values match then only check for characters one by one
        if (p == t) then
            //check for characters one by one
            for j in [1, m]:
                if (text[i+j] != pattern[j]) then
                    break
                else:
                    j += 1
 
            //if p == t and pattern[0, ..., m - 1] = text[i, i + 1, ..., i + m - 1]
            if (j == m) then
                print("Pattern found at index " + i)
 
        //calculate hash value for next window of text:
        //remove leading digit and add trailing digit, using h-term
        if (i < n - m) then
            t = (d*(t - ord(text[i])*h) + ord(text[i + m])) % q
 
            //we might get negative values of t, converting it to positive
            if t < 0:
                t = t+q

Boyer-Moore algorithm is an efficient substring search algorithm that utilizes two main heuristics: the "bad character rule" and the "good suffix rule". These rules enable skipping comparisons by analyzing mismatches and taking advantage of previously matched characters. The Boyer-Moore algorithm has a time complexity of $O(n + m)$ in the average case.

function build_bad_character_table(pattern):
    m = len(pattern)
    bad_character_table = {} //constructing an empty dictionary for mapping characters to integers
    
    for i in [0, m - 1]:
        bad_character_table[pattern[i]] = i
    
    return bad_character_table

function build_good_suffix_table(pattern):
    m = len(pattern)
    j = 0
    //intializing arrays with zeroes
    suffixes = [0] * m
    good_suffix_table = [0] * m
    
    for i in [m - 2, -1]:
        if ((i > j) and (suffixes[i + m - 1 - j] < i - j)) then
            suffixes[i] = suffixes[i + m - 1 - j]
        else
            if (i < j) then
                j = i
            
            while ((j >= 0) and (pattern[j] == pattern[j + m - 1 - i])):
                j -= 1
            
            suffixes[i] = i - j
    
    for i in [1, m]:
        good_suffix_table[i] = m - suffixes[m - 1]
    
    for i in [1, m - 1]:
        good_suffix_table[m - 1 - suffixes[i]] = m - 1 - i
    
    return good_suffix_table

function boyer_moore(text, pattern):
    n = len(text)
    m = len(pattern)
    
    //If the pattern is empty return 0
    if (m == 0) then
        return 0
    
    //Preprocessing the pattern
    bad_char_table = Make_Bad_Char_Table(pattern)
    good_suffix_table = Create_Good_Suffix_Table(pattern)
    
    text_pointer = m //Search from the end of the pattern

    while (text_pointer <= n):
        pattern_pointer = m

        //If current characters match and there are characters left for matching in the pattern
        //shift both pointers to the left by one character
        while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]):
            text_pointer = text_pointer - 1
            pattern_pointer = pattern_pointer - 1
        
        //there are no characters left for matching in the pattern
        if (pattern_pointer == 0) then
            print("Pattern found at ", text_pointer + 1)  // Match found
            text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences
            continue

        //Determining shift size
        //Bad character rule
        if text[text_pointer] in bad_char_table:
            bad_char_shift = bad_char_table[text[text_pointer]]
        //If not present, the shift size is the length of the pattern
        else:
            bad_char_shift = m
        //Good suffix rule       
        suffix_length = m - pattern_pointer //Length of the matched suffix
        good_suffix_shift = good_suffix_table[suffix_length] //Look up for the shift size
        
        //Update the text pointer
        text_pointer = text_pointer + max(bad_char_shift, good_suffix_shift)

Aho-Corasick algorithm is a string-matching algorithm that efficiently searches for multiple patterns simultaneously. It constructs a finite-state machine called the "Aho-Corasick automaton" to perform substring search efficiently.

Aho-Corasick automaton example

This algorithm has a runtime complexity of $O(n + m + k)$ . To find out more, explore our introductory and advanced topics on it.

When to use each algorithm

By understanding the strengths and applications of each algorithm, you can make informed decisions about which one to use based on your specific requirements. Different algorithms serve different purposes and are more suitable for specific scenarios. Here are some guidelines on when to use each algorithm:

Naïve Substring Search. As you have already noticed, it is generally inefficient for most applications, except, perhaps, educational and teaching purposes. When developing a real hands-on code, please refrain from using it.
Prefix Function is particularly useful when dealing with patterns that contain repeated substrings. It helps efficiently identify and utilize these repetitions, making it a valuable choice in such cases.
Knuth-Morris-Pratt Algorithm: The KMP algorithm is generally quite efficient and therefore is widely used and exploited by various real-life implementations of searching engines.
Rabin-Karp Algorithm. If your application requires multiple pattern searches, the Rabin-Karp algorithm is a great choice. It excels in scenarios where you need to find occurrences of a pattern in a text multiple times, providing an efficient solution.
Boyer-Moore Algorithm: When dealing with large texts and complex patterns, the Boyer-Moore algorithm is a reliable option. It employs a powerful strategy that skips unnecessary comparisons, resulting in faster searches and making it suitable for such challenging scenarios.
Aho-Corasick Algorithm. Searching for a large number of patterns within a single pass is where the Aho-Corasick algorithm shines. It efficiently handles multiple pattern searches simultaneously, making it an excellent choice in scenarios where you need to find numerous patterns efficiently.

Comparison of algorithms

Here, we are going to compare the runtime complexities of the aforementioned substring searches, highlighting their best-case, worst-case, and average-case scenarios. Understanding these complexities will provide insights into the efficiency and performance of each algorithm. Have a look at this nicely compilation which compares all of them at once. For convenience, we exploit the same notation as above: $n$ is the length of text string, $m$ is the length of pattern, and $k$ is the number of pattern occurences in the text.

Algorithm	Average case	Worst case	When to use
Naïve Substring Search	$O(n\times m)$	$O(n\times m)$	ONLY educational purposes
Prefix Function	$O(n+m)$	$O(n+m)$	Patterns with repeated parts
Knuth-Morris-Pratt	$O(n+m)$	$O(n+m)$	Generally efficient and good
Rabin-Karp	$O(n+k\times m)$	$O(n\times m)$	Multiple pattern searches
Boyer-Moore	$O(n+m)$	$O(n\times m)$	Huge texts and complex patterns
Aho-Corasick	$O(n+m+k)$	$O(n+m+k)$	Large number of patterns in a single pass of text string

Conclusion

In this topic, we explored various substring search algorithms that are essential for software developers. We started with the Naïve Substring Search algorithm, which provides a simple but inefficient approach. Then, we delved into more advanced algorithms like the Prefix Function Search, Knuth-Morris-Pratt (KMP) algorithm, Rabin-Karp algorithm, Boyer-Moore algorithm, and Aho-Corasick algorithm. Each algorithm offers unique advantages and is suitable for different scenarios.

By mastering these algorithms, you now have a powerful set of tools to efficiently search for substrings in texts. Whether you're working on a text editor, search engine, or any application that involves pattern matching, understanding these algorithms will significantly improve your ability to solve substring search problems effectively.

The key to successfully using these algorithms lies in understanding their underlying principles and knowing when to apply each one. Keep practicing and applying these concepts in real-world scenarios to strengthen your knowledge and problem-solving skills, for example, in the forthcoming task steps. See you there!

How did you like the theory?

Report a typo

Substring search

Introduction

Algorithms description and pseudocode

When to use each algorithm

Comparison of algorithms

Conclusion

Related topics