Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Boyer-Moore algorithm

4 minutes read

Recap and Integration of Two Rules

The combination of the Good Suffix and Bad Character rules helps the Boyer-Moore algorithm to excel. The Bad Character rule lets the algorithm shift the pattern from right to left across the text when a mismatch occurs. If the mismatched text character doesn't exist in the pattern, the pattern shifts completely over the mismatched character. If the mismatched character is in the pattern, the pattern moves until the mismatched character in the text lines up with the same character in the pattern.

On the other hand, the Good Suffix rule is activated when a character match is found; that is, a suffix in the pattern matches a suffix in the text. The algorithm moves the pattern to align with the matched good suffix in the text. If no match is found, the pattern shifts completely over the matched part.

The final shift for the pattern is the maximum shift offered by the Bad Character and Good Suffix rules. This blending of rules significantly optimizes the Boyer-Moore algorithm's efficiency in string searching tasks.

An Example

Let's demonstrate the Boyer Moore algorithm using a basic example. Assume we have a text ABABDABACCABABCABABABABDABACCABABCABAB and we're looking for the pattern ABABCABABABABCABAB. Before searching for the pattern in the text, we need to preprocess the pattern. Preprocessing involves creating the bad character table and the good suffix table.

Bad character table for the pattern ABABCABABABABCABAB

Bad Character

Shift size

A

1

B

0

C

4

*

9

Good suffix table for the pattern ABABCABABABABCABAB

Suffix length

Shift size

0

1

1

10

2

4

3

10

4

9

5

10

6

11

7

12

8

13

Now we're ready to start the pattern search in the text. Let's get started.

Starting from the right, align the pattern with the text. We use a text pointer and a pattern pointer to track the matching characters.

Boyer-Moore 1

We start comparing characters from the pattern's right with the corresponding characters in the text. Here, we find a mismatch between CC in the text and BB in the pattern. At this point, we refer to the two tables above. The bad character is CC, and the length of the matched suffix or good suffix is 00. The bad character table suggests shifting the text pointer by 44 while the good suffix table advises a shift by 11. We choose the maximum shift size hence, we follow the bad character rule here and shift the text pointer by 44 characters. Next, we reset the pattern pointer and start matching again.

Boyer-Moore 2

Now, AA is the bad character, and the length of the good suffix is 00. Both tables suggest shifting the text pointer by 11 character. Let's do so and start matching again from the right.

Boyer-Moore 3

This time CC is the bad character again and CABABCABAB is the good suffix whose length is 55. The bad character rule asks us to shift the text pointer by 44 characters and the good suffix rule wants a shift by 1010 characters. We pay attention to the good suffix rule since this is the maximum one and start matching again.

Boyer-Moore 4

Finally, we have found the pattern!

This way, by integrating the Good Suffix and Bad Character rules, the Boyer-Moore algorithm efficiently locates the pattern in the text.

Pseudocode

Here's the pseudocode for the algorithm. Please refer to the prior topics on the bad character rule and the good suffix rule to see the implementation of the preprocessing functions: Make_Bad_Char_Table()Make\_Bad\_Char\_Table(), Create_Good_Suffix_Table()Create\_Good\_Suffix\_Table().

function Boyer_Moore(text, pattern):
    n = length(text)
    m = length(pattern)
    
    //If the pattern is empty return 0
    if m == 0 then
        return 0
    
    //Preprocessing the pattern
    bad_char_table = Make_Bad_Char_Table(pattern)
    good_suffix_table = Create_Good_Suffix_Table(pattern)
    
    text_pointer = m //Search from the end of the pattern

    while (text_pointer <= n):
        pattern_pointer = m

        //If current characters match and there are characters left for matching in the pattern
        //shift both pointers to the left by one character
        while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]):
            text_pointer = text_pointer - 1
            pattern_pointer = pattern_pointer - 1
        
        //there are no characters left for matching in the pattern
        if pattern_pointer == 0:
            print("Pattern found at ", text_pointer + 1)  // Match found
            text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences
            continue

        //Determining shift size
        //Bad character rule
        if text[text_pointer] in bad_char_table:
            bad_char_shift = bad_char_table[text[text_pointer]]
        //If not present, the shift size is the length of the pattern
        else:
            bad_char_shift = m
        //Good suffix rule       
        suffix_length = m - pattern_pointer //Length of the matched suffix
        good_suffix_shift = good_suffix_table[suffix_length] //Look up for the shift size
        
        //Update the text pointer
        text_pointer = text_pointer + max(bad_char_shift, good_suffix_shift)

Complexity

Boyer-Moore algorithm's time complexity is interesting as it doesn't abide by the usual rules. In the best-case situation, where every character comparison results in a mismatch, the algorithm operates in O(nm)O(\frac{n}{m}) time, where nn is the text's length and mm is the pattern's length. This is because after every mismatch, the algorithm can skip mm characters, resulting in nm\frac{n}{m} comparisons.

In the worst-case, the time complexity can be O(mn)O(mn), which happens when all pattern characters are identical and the pattern appears at the end of the text. However, such a situation is unlikely to occur in practical cases. On average, the Boyer-Moore algorithm performs well, especially for long patterns and extensive texts, making it one of the most competent string searching algorithms.

Conclusion

In conclusion, the Boyer-Moore algorithm is an effective tool for string searching tasks. Its distinct approach of scanning the pattern from right to left and incorporating the Good Suffix and Bad Character rules allows it to bypass many needless comparisons, making it highly efficient, particularly for lengthy patterns and large texts.

The only challenging part for the algorithm is to preprocess the pattern to create the two necessary tables. However, as long as you are looking for the same pattern in different texts, you can create these tables once and reuse them as needed. In practice, once the pattern has been preprocessed, the Boyer-Moore algorithm usually outperforms many other pattern-matching algorithms, especially with large texts.

How did you like the theory?
Report a typo