The Boyer-Moore Good Suffix Rule is a fundamental concept in the field of string searching and pattern matching. By exploiting information about the suffix of the pattern being matched, the Good Suffix Rule enables the algorithm to skip over large portions of text, significantly reducing the number of character comparisons required during the search process. This ingenious technique has become a cornerstone of many efficient string-searching algorithms and continues to be a valuable asset in computer science and text processing.
Good suffix heuristic
Continuing from the Boyer-Moore: Bad character rule, we will learn about the second answer to the question: "What to do when there is a mismatch?"
As before, let's align the pattern with the text and initiate matching from the pattern's right end. We keep two pointers: text pointer and pattern pointer to indicate the characters we are currently comparing.
Let's assume we encounter a mismatch after a few matching characters. We'll refer to this matching substring as . Next, we'll search for this substring within , excluding the matched suffix. We find three such instances:
We will adjust the pattern so that aligns with one of the three substrings. But which one should we choose? Initially, we'll choose the rightmost one, i.e., substring number . However, the character to its left is , which is identical to the character to the left of the suffix . This shift would put us back in the situation we are currently in, making it an ineffective shift. So we discard it and select substring number instead. The character to its left is , which differs from . Thus, shifting this to align with will result in a beneficial shift.
Now, let's have a look at the shift of the pointers:
The text pointer has been moved to the right by characters. This shift corresponds to the distance between the substrings we aim to align, which is , added to the length of the suffix , which is .
You might be curious about what happens if there is no within outside of the suffix itself. In such a scenario, we look for the largest suffix of that matches with a prefix of , and adjust the pattern to align them. Here's an example to illustrate this:
Here is the illustration to the shift of the text pointer.
The text pointer has moved to the right by characters. This shift equals the distance between the chosen substrings , which is , plus the length of the suffix , which is .
Finally, what should we do if both cases fail? In such a scenario, we simply shift the pattern to the right by the length of the pattern. In this case, the pointer will move to the right by a distance equivalent to the sum of the pattern's length and the length of the suffix . Here is an illustrated example:
One final note to keep in mind is that by default, we shift by one character if the mismatch occurs during the initial matching, that is, when is empty.
Here we provide the summary of the good suffix rules:
When a matching suffix t of the pattern is found elsewhere in the pattern, align them if their left characters are different.
If t is not available in the pattern, look for the longest suffix of t that matches the prefix of the pattern and align them.
If both of the cases fail, shift the whole pattern.
Shift by 1 character if |t|=0.
Good Suffix Table
Just like the bad character table, you can create a lookup table for potential suffixes of the pattern, which we call the good suffix table. This is where we list all suffixes of the pattern (excluding the pattern itself, as this indicates a perfect match) and the corresponding shift sizes for the text pointer. Let's look at the good suffix table for the pattern .
Good Suffix Table for BCACBCBC | ||
Suffix, t | length(t) | Shift size |
|---|---|---|
"" | ||
Creating a good suffix table
To create a good suffix table we prepare a helper table which will help us to compute the correct shift size. To illustrate let's take BCACBCBC as an example pattern. The helper table for the pattern will look like this:
helper_table | |
Shift | Matched size |
|---|---|
1 | 0 |
2 | 3 |
3 | 0 |
4 | 1 |
5 | 0 |
6 | 2 |
7 | 0 |
8 | 0 |
What does this table represent? Let's take the shift as an example. The helper table says that if we compare the pattern with itself by shifting one to the right by characters, we will find matched characters. Here is the illustration:
Here are two more illustrations for shift = and shift = :
Here is the pseudocode to prepare the helper table:
function create_helper_table(pattern):
m = Length(pattern)
helper_table = {} // Create an empty dictionary
for shift = 1 down to m - 1:
pointer_1 = m // Pointer for the pattern
pointer_2 = m - shift // Pointer for the shifted pattern
while pointer_2 >= 1:
// If the characters match, we move the pointers to the left
if pattern[pointer_1] == pattern[pointer_2]:
pointer_1 = pointer_1 - 1
pointer_2 = pointer_2 - 1
if pointer_2 == 0: //If all characters of the shifted pattern match
helper_table[shift] = m - shift
break
else:
helper_table[shift] = m - shift - pointer_2
break
helper_table[m] = 0 // There is no match if we shift the whole pattern
return helper_tableLet us tell you another representation of the helper table. The helper table says that we will find the suffix of with length , i.e , distance away from the suffix. In the helper table, the matched size corresponds to the suffix length and the shift corresponds to the distance. Here is the illustration:
Now, let's build the good suffix table for . We will denote the length of the pattern by and the length of the suffix by . Initially, we assume that none of the suffixes are found in the pattern. So we populate the using . Next, if , we update the good suffix table for the corresponding using the formula: where both the distance and can be found from the helper table's shift and matched size column respectively.
Please look at the helper table again for the last time. The matched size for shift is . Here, , which is equal to . Whenever this happens, we can see that the suffix matches with a prefix of the pattern:
We can use this observation to update the shift sizes for all suffix with . For such suffixes, if it is not found elsewhere in the pattern, we are certain that the suffix has a suffix that matches a prefix of the pattern. The formula for the shift size is: where both the distance and can be found from the helper table as before.
Finally, here is the complete pseudocode to prepare the good suffix table:
function Create_Good_Suffix_Table(pattern):
m = length(pattern)
table = {} // Create an empty dictionary
table[0] = 1 //Default for |t| = 0
helper_table = create_helper_table(pattern)
//Initially fill the table using m+|t|
for i = 1 to m-1:
table[i] = i + m
//Update the table if the suffix t is available elsewhere
for i = m downto 1:
if helper_table[i] > 0:
table[helper_table[i]] = i + helper_table[i]
//Update the table if there is any suffix that matches the prefix
for i = m downto 1:
if helper_table[i] + i == m:
for j = helper_table[i] + 1 to m-1:
table[j] = min(table[j], j + i)
return tablePseudocode
Now we are prepared with all the tools and concepts to write the Boyer-Moore pattern-matching algorithm using the good suffix rule. Here is the pseudocode:
function BM_good_suffix_rule(text, pattern):
n = length(text)
m = length(pattern)
//If the pattern is empty return 0
if m == 0 then
return 0
good_suffix_table = Create_Good_Suffix_Table(pattern)
text_pointer = m //Search from the end of the pattern
while (text_pointer <= n):
pattern_pointer = m
//If current characters match and there are characters left for matching in the pattern
//shift both pointers to the left by one character
while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]):
text_pointer = text_pointer - 1
pattern_pointer = pattern_pointer - 1
//there are no characters left for matching in the pattern
if pattern_pointer == 0:
print("Pattern found at ", text_pointer + 1) // Match found
text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences
continue
suffix_length = m - pattern_pointer //Length of the matched suffix
good_suffix_shift = good_suffix_table[suffix_length] //Look up for the shift size
//Update the text pointer
text_pointer = text_pointer + good_suffix_shiftComplexity
The complexity of the Boyer-Moore Good Suffix Rule mainly depends on the characteristics of the input text and pattern being searched. In the best-case scenario, when the pattern occurs only once in the text and the good suffix heuristic allows for maximal shifts, the Boyer-Moore algorithm can achieve a linear time complexity of , where is the length of the text, and is the length of the pattern. However, in the worst-case scenario where the pattern and text share many overlapping suffixes, the algorithm can approach a quadratic time complexity of . Nevertheless, in practical applications, the Boyer-Moore algorithm generally outperforms other string searching algorithms due to its efficient handling of mismatches and its ability to skip large portions of text.
Conclusion
In summary, the Boyer-Moore Good suffix rule involves two steps:
Preprocess the pattern to create a good suffix table.
During the search, calculate shifts using the table and apply them when there's a mismatch between the pattern and text.
The Boyer-Moore Good Suffix Rule is a critical component of the Boyer-Moore string searching algorithm, offering a systematic and efficient approach to locating patterns within text. By precomputing a suffix table and strategically shifting the pattern during searches, this rule optimizes the algorithm's performance, making it one of the fastest and most widely used string searching techniques in practical applications. In a later topic we will combine this rule with the bad character rule to develop the complete Boyer-Moore algorithm.