Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Boyer-Moore: Bad character rule

9 minutes read

Large-scale text processing has become an essential component of many applications in the increasingly digital world. The ability to quickly comb through enormous amounts of text to locate certain strings of letters is a critical component of search engines like Google, text editors like Microsoft Word, and even coding platforms like GitHub. This is the problem at the heart of their operation: How can we efficiently locate a specific string within a larger body of text?

The Boyer-Moore algorithm is a very effective solution to this issue. In the Boyer-Moore algorithm, we always start matching the pattern from the last character of the pattern and move backward. Now the only question we always ask in any pattern-matching algorithm is "What to do when there is a mismatch?". Professor Boyer and Professor Moore have two answers to the question and we will learn the first one.

Bad character heuristic

In instances where a character from the text doesn't align with a character from the pattern, we refer to the non-aligned character from the text as the bad character. For instance, consider the pattern $P = ABAC$ and the text $S = ABECFABAABAC$ . For the purpose of this discussion, we maintain two pointers that indicate the present matching index for both $S$ and $P$ . As shown in the following image, $E$ is identified as a bad character.

Bad character

Each time we come across a bad character, we attempt to locate it within the pattern. In this scenario, the pattern $P$ does not contain the bad character $E$ . As a result, we can move the pattern beyond this bad character and begin the matching process anew from the end.

Bad character 2

Let's turn our attention to the text pointer for a moment. Its position has moved from index $3$ to index $7$ , meaning it has shifted to the right by four characters, equivalent to the length of the pattern. It's important to note this result: when we encounter a bad character that isn't found in the pattern, we move the text pointer to the right by a distance equivalent to the pattern's length, and we reset the pattern pointer to its endpoint.

In this instance, $B$ is identified as the bad character. Notably, $B$ is included in our pattern $P$ . In such a situation, we will adjust the pattern such that the two $B$ 's coincide, and then restart the matching process from the end.

Bad character 3

Let's revisit the text pointer. It has moved to the right by two characters. This shift corresponds to the distance of $B$ from the right end of the pattern, which is two. Here's our second key point: if the bad character is included in the pattern, we shift the text pointer to the right by a distance equivalent to the character's distance from the pattern's right end, and then we reset the pattern pointer.

We've encountered another bad character, $A$ this time. Interestingly, our pattern contains two $A$ 's. The question arises, how many characters should we shift now? The solution is to adjust the pattern so that the rightmost $A$ in the pattern aligns with the bad character. This brings us to our final bad character rule: if the bad character appears multiple times in the pattern, we shift the text pointer to the right by a distance equivalent to the rightmost character's distance from the pattern's right end, and then we reset the pattern pointer.

We continue these processes till we find the pattern in the text or reach the end of the text.

Bad character 4

Bad character - matched

In short, we have three bad character rules:

If we encounter a bad character that is not present in the pattern, we move the text pointer to the right by the length of the pattern and reset the pattern pointer to its endpoint.
If the bad character is part of the pattern, we shift the text pointer to the right by the distance between the character and the right end of the pattern and then reset the pattern pointer.
If there are multiple occurrences of the bad character in the pattern, we shift the text pointer to the right by the distance between the rightmost occurrence of that character and the right end of the pattern and then reset the pattern pointer.

Bad character table

To expediently calculate the required shift of the text pointer, we construct a lookup table. This table comprises all unique characters present in the pattern and the corresponding shift needed. This table is referred to as the bad character table. Let's take a look at the bad character table for the pattern $ABAC$ .

Bad character	Shift size
A	1
B	2
C	0
*	4

Here $*$ denotes all characters that are not present in the pattern.

Pseudocode

Here is the pseudocode to preprocess the pattern, i.e. creating the bad character table. The formula to calculate the step size for a bad character that is present in the pattern is: $m - i$ where $m$ is the length of the pattern and $i$ the index of the character in the pattern.

function Make_Bad_Char_Table(pattern):
    table = empty_dictionary
    m = length(pattern)
    for i = 1 to m:
        table[pattern[i]] = m - i
 
    return table

The preprocessing function doesn't calculate the step size if the bad character is absent in the pattern. We can handle this during pattern searching. Here is the pseudocode to search for a pattern in a given text using the bad character rule.

function BM_bad_char_rule(text, pattern):
    n = length(text)
    m = length(pattern)
    
    //If the pattern is empty return 0
    if m == 0 then
        return 0
    
    table_bad_char = Make_Bad_Char_Table(pattern)
    
    text_pointer = m //Search from the end of the pattern

    while (text_pointer <= n):
        pattern_pointer = m

        //If current characters match and there are characters left for matching in the pattern
        //shift both pointers to the left by one character
        while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]):
            text_pointer = text_pointer - 1
            pattern_pointer = pattern_pointer - 1
        
        //there are no characters left for matching in the pattern
        if pattern_pointer == 0:
            print("Pattern found at ", text_pointer + 1)  // Match found
            text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences
            continue
        
        //Determining shift size
        //If the bad character is present in the bad character table
        if text[i] in table_bad_char:
            bad_char_shift = table_bad_char[text[text_pointer]]
        //If not present, the shift size is the length of the pattern
        else:
            bad_char_shift = m

        //Update the text pointer
        text_pointer = text_pointer + bad_char_shift

Complexity analysis

The bad character rule has the best-case time complexity of $O(\frac{n}{m})$ and a worst-case time complexity of $O(nm)$ , where $n$ is the length of the text and $m$ is the length of the pattern. The best-case scenario occurs when all characters in the pattern and text strings are different. In this case, the algorithm can skip $m$ comparisons in each step. The worst case scenario occurs when each character of both pattern and text is the same.

In addition to that, the preprocessing step requires $O(m)$ time. However, this time is negligible because we can create the bad character table once and use that multiple times as long as we search for the same pattern in different texts.

Limitations

Will the bad character always find the pattern? The answer is no. Consider the text, $S = CABACAABAC$ , and the pattern, $P = AABAC$ . Let's start matching:

Bad character - limitation

If you follow the rules above, it will shift the text pointer to the right by zero characters. Certainly, this will produce the wrong output.

This algorithm will fail in two cases: the bad character rule produces zero shifts, and the second case is when the bad character is found on the right of the pattern pointer. Can you modify the pseudocode to handle such situations? Please comment it down.

We will learn a different heuristic of the Boyer-Moore algorithm, the good suffix rule, to overcome these limitations in a later topic.

Conclusion

The mismatched character in the text is the bad character. Whenever we find a bad character, we do these:

If the bad character is not in the pattern, we shift the text pointer to the right by an amount equal to the length of the pattern.
If the bad character is in the pattern, we shift the text pointer to the right by an amount equal to the distance of the rightmost character from the end of the pattern.

The algorithm has a time complexity of $O(\frac{n}{m})$ in the best case and $O(nm)$ in the worst case. In practice, the worst case is very rare. However, the algorithm is not a complete one since this will fail to find the pattern sometimes. We will learn to overcome this in a later topic by analyzing the second rule: the good suffix rule.

How did you like the theory?

Report a typo