Recap and Integration of Two Rules
The combination of the Good Suffix and Bad Character rules helps the Boyer-Moore algorithm to excel. The Bad Character rule lets the algorithm shift the pattern from right to left across the text when a mismatch occurs. If the mismatched text character doesn't exist in the pattern, the pattern shifts completely over the mismatched character. If the mismatched character is in the pattern, the pattern moves until the mismatched character in the text lines up with the same character in the pattern.
On the other hand, the Good Suffix rule is activated when a character match is found; that is, a suffix in the pattern matches a suffix in the text. The algorithm moves the pattern to align with the matched good suffix in the text. If no match is found, the pattern shifts completely over the matched part.
The final shift for the pattern is the maximum shift offered by the Bad Character and Good Suffix rules. This blending of rules significantly optimizes the Boyer-Moore algorithm's efficiency in string searching tasks.
An Example
Let's demonstrate the Boyer Moore algorithm using a basic example. Assume we have a text and we're looking for the pattern . Before searching for the pattern in the text, we need to preprocess the pattern. Preprocessing involves creating the bad character table and the good suffix table.
Bad character table for the pattern
Bad Character | Shift size |
|---|---|
A | 1 |
B | 0 |
C | 4 |
* | 9 |
Good suffix table for the pattern
Suffix length | Shift size |
|---|---|
0 | 1 |
1 | 10 |
2 | 4 |
3 | 10 |
4 | 9 |
5 | 10 |
6 | 11 |
7 | 12 |
8 | 13 |
Now we're ready to start the pattern search in the text. Let's get started.
Starting from the right, align the pattern with the text. We use a text pointer and a pattern pointer to track the matching characters.
We start comparing characters from the pattern's right with the corresponding characters in the text. Here, we find a mismatch between in the text and in the pattern. At this point, we refer to the two tables above. The bad character is , and the length of the matched suffix or good suffix is . The bad character table suggests shifting the text pointer by while the good suffix table advises a shift by . We choose the maximum shift size hence, we follow the bad character rule here and shift the text pointer by characters. Next, we reset the pattern pointer and start matching again.
Now, is the bad character, and the length of the good suffix is . Both tables suggest shifting the text pointer by character. Let's do so and start matching again from the right.
This time is the bad character again and is the good suffix whose length is . The bad character rule asks us to shift the text pointer by characters and the good suffix rule wants a shift by characters. We pay attention to the good suffix rule since this is the maximum one and start matching again.
Finally, we have found the pattern!
This way, by integrating the Good Suffix and Bad Character rules, the Boyer-Moore algorithm efficiently locates the pattern in the text.
Pseudocode
Here's the pseudocode for the algorithm. Please refer to the prior topics on the bad character rule and the good suffix rule to see the implementation of the preprocessing functions: , .
function Boyer_Moore(text, pattern):
n = length(text)
m = length(pattern)
//If the pattern is empty return 0
if m == 0 then
return 0
//Preprocessing the pattern
bad_char_table = Make_Bad_Char_Table(pattern)
good_suffix_table = Create_Good_Suffix_Table(pattern)
text_pointer = m //Search from the end of the pattern
while (text_pointer <= n):
pattern_pointer = m
//If current characters match and there are characters left for matching in the pattern
//shift both pointers to the left by one character
while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]):
text_pointer = text_pointer - 1
pattern_pointer = pattern_pointer - 1
//there are no characters left for matching in the pattern
if pattern_pointer == 0:
print("Pattern found at ", text_pointer + 1) // Match found
text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences
continue
//Determining shift size
//Bad character rule
if text[text_pointer] in bad_char_table:
bad_char_shift = bad_char_table[text[text_pointer]]
//If not present, the shift size is the length of the pattern
else:
bad_char_shift = m
//Good suffix rule
suffix_length = m - pattern_pointer //Length of the matched suffix
good_suffix_shift = good_suffix_table[suffix_length] //Look up for the shift size
//Update the text pointer
text_pointer = text_pointer + max(bad_char_shift, good_suffix_shift)Complexity
Boyer-Moore algorithm's time complexity is interesting as it doesn't abide by the usual rules. In the best-case situation, where every character comparison results in a mismatch, the algorithm operates in time, where is the text's length and is the pattern's length. This is because after every mismatch, the algorithm can skip characters, resulting in comparisons.
In the worst-case, the time complexity can be , which happens when all pattern characters are identical and the pattern appears at the end of the text. However, such a situation is unlikely to occur in practical cases. On average, the Boyer-Moore algorithm performs well, especially for long patterns and extensive texts, making it one of the most competent string searching algorithms.
Conclusion
In conclusion, the Boyer-Moore algorithm is an effective tool for string searching tasks. Its distinct approach of scanning the pattern from right to left and incorporating the Good Suffix and Bad Character rules allows it to bypass many needless comparisons, making it highly efficient, particularly for lengthy patterns and large texts.
The only challenging part for the algorithm is to preprocess the pattern to create the two necessary tables. However, as long as you are looking for the same pattern in different texts, you can create these tables once and reuse them as needed. In practice, once the pattern has been preprocessed, the Boyer-Moore algorithm usually outperforms many other pattern-matching algorithms, especially with large texts.