Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Boyer-Moore: Good suffix rule

8 minutes read

The Boyer-Moore Good Suffix Rule is a fundamental concept in the field of string searching and pattern matching. By exploiting information about the suffix of the pattern being matched, the Good Suffix Rule enables the algorithm to skip over large portions of text, significantly reducing the number of character comparisons required during the search process. This ingenious technique has become a cornerstone of many efficient string-searching algorithms and continues to be a valuable asset in computer science and text processing.

Good suffix heuristic

Continuing from the Boyer-Moore: Bad character rule, we will learn about the second answer to the question: "What to do when there is a mismatch?"

As before, let's align the pattern P=ABCABDABDABP = ABCABDABDAB with the text S=ABECFAABCABDABDABCS = ABECFAABCABDABDABC and initiate matching from the pattern's right end. We keep two pointers: text pointer and pattern pointer to indicate the characters we are currently comparing.

Let's assume we encounter a mismatch after a few matching characters. We'll refer to this matching substring as tt. Next, we'll search for this substring tt within PP, excluding the matched suffix. We find three such instances:

Good suffix 1

Good suffix 2

We will adjust the pattern so that tt aligns with one of the three substrings. But which one should we choose? Initially, we'll choose the rightmost one, i.e., substring number 33. However, the character to its left is DD, which is identical to the character to the left of the suffix tt. This shift would put us back in the situation we are currently in, making it an ineffective shift. So we discard it and select substring number 22 instead. The character to its left is CC, which differs from DD. Thus, shifting this ABAB to align with tt will result in a beneficial shift.

Good suffix 3Good suffix 4

Now, let's have a look at the shift of the pointers:

Good suffix 5

The text pointer has been moved to the right by 88 characters. This shift corresponds to the distance between the substrings we aim to align, which is 66, added to the length of the suffix tt, which is 22.

You might be curious about what happens if there is no tt within PP outside of the suffix itself. In such a scenario, we look for the largest suffix of tt that matches with a prefix of PP, and adjust the pattern to align them. Here's an example to illustrate this:

Good suffix 6Good suffix 8

Here is the illustration to the shift of the text pointer.

Good suffix 9

The text pointer has moved to the right by 88 characters. This shift equals the distance between the chosen substrings DBDB, which is 55, plus the length of the suffix tt, which is 33.

Finally, what should we do if both cases fail? In such a scenario, we simply shift the pattern to the right by the length of the pattern. In this case, the pointer will move to the right by a distance equivalent to the sum of the pattern's length and the length of the suffix tt. Here is an illustrated example:

One final note to keep in mind is that by default, we shift by one character if the mismatch occurs during the initial matching, that is, when tt is empty.

Here we provide the summary of the good suffix rules:

  1. When a matching suffix t of the pattern is found elsewhere in the pattern, align them if their left characters are different.

  2. If t is not available in the pattern, look for the longest suffix of t that matches the prefix of the pattern and align them.

  3. If both of the cases fail, shift the whole pattern.

  4. Shift by 1 character if |t|=0.

Good Suffix Table

Just like the bad character table, you can create a lookup table for potential suffixes of the pattern, which we call the good suffix table. This is where we list all suffixes of the pattern (excluding the pattern itself, as this indicates a perfect match) and the corresponding shift sizes for the text pointer. Let's look at the good suffix table for the pattern P=BCACBCBCP = BCACBCBC.

Good Suffix Table for BCACBCBC

Suffix, t

length(t)

Shift size

""

00

11

CC

11

55

BCBC

22

88

CBCCBC

33

55

BCBCBCBC

44

1010

CBCBCCBCBC

55

1111

ACBCBCACBCBC

66

1212

CACBCBCCACBCBC

77

1313

Creating a good suffix table

To create a good suffix table we prepare a helper table which will help us to compute the correct shift size. To illustrate let's take BCACBCBC as an example pattern. The helper table for the pattern will look like this:

helper_table

Shift

Matched size

1

0

2

3

3

0

4

1

5

0

6

2

7

0

8

0

What does this table represent? Let's take the shift 22 as an example. The helper table says that if we compare the pattern with itself by shifting one to the right by 22 characters, we will find 33 matched characters. Here is the illustration:

helper table

Here are two more illustrations for shift = 44 and shift = 66:

helper tablehelper table

Here is the pseudocode to prepare the helper table:

function create_helper_table(pattern):
    m = Length(pattern)
    helper_table = {}  // Create an empty dictionary

    for shift = 1 down to m - 1:
        pointer_1 = m // Pointer for the pattern
        pointer_2 = m - shift // Pointer for the shifted pattern

        while pointer_2 >= 1:
            // If the characters match, we move the pointers to the left
            if pattern[pointer_1] == pattern[pointer_2]:
                pointer_1 = pointer_1 - 1
                pointer_2 = pointer_2 - 1
                if pointer_2 == 0: //If all characters of the shifted pattern match
                    helper_table[shift] = m - shift
                    break
            else:
                helper_table[shift] = m - shift - pointer_2
                break
    helper_table[m] = 0 // There is no match if we shift the whole pattern
    
    return helper_table

Let us tell you another representation of the helper table. The helper table says that we will find the suffix tt of BCACBCBCBCACBCBC with length 11, i.e CC, 44 distance away from the suffix. In the helper table, the matched size corresponds to the suffix length and the shift corresponds to the distance. Here is the illustration:

helper table

Now, let's build the good suffix table for BCACBCBCBCACBCBC. We will denote the length of the pattern by mm and the length of the suffix tt by t|t|. Initially, we assume that none of the suffixes are found in the pattern. So we populate the tabletable using m+tm + |t|. Next, if t>0|t| > 0, we update the good suffix table for the corresponding t|t| using the formula: table[t]=distance+ttable[|t|] = distance + |t| where both the distance and t|t| can be found from the helper table's shift and matched size column respectively.

Please look at the helper table again for the last time. The matched size for shift 66 is 22. Here, 6+2=86+2=8, which is equal to mm. Whenever this happens, we can see that the suffix tt matches with a prefix of the pattern:

helper table

We can use this observation to update the shift sizes for all suffix tt with t>2|t|>2. For such suffixes, if it is not found elsewhere in the pattern, we are certain that the suffix has a suffix t=BCt' = BC that matches a prefix of the pattern. The formula for the shift size is: table[t]=distance+ttable[|t|] = distance + |t'| where both the distance and t|t'| can be found from the helper table as before.

Finally, here is the complete pseudocode to prepare the good suffix table:

function Create_Good_Suffix_Table(pattern):
    m = length(pattern)
    table = {}  // Create an empty dictionary
    table[0] = 1 //Default for |t| = 0
    helper_table = create_helper_table(pattern)

    //Initially fill the table using m+|t|
    for i = 1 to m-1:
        table[i] = i + m

    //Update the table if the suffix t is available elsewhere
    for i = m downto 1:
        if helper_table[i] > 0:
            table[helper_table[i]] = i + helper_table[i]

    //Update the table if there is any suffix that matches the prefix
    for i = m downto 1:
        if helper_table[i] + i == m:
            for j = helper_table[i] + 1 to m-1:
                table[j] = min(table[j], j + i)

    return table

Pseudocode

Now we are prepared with all the tools and concepts to write the Boyer-Moore pattern-matching algorithm using the good suffix rule. Here is the pseudocode:

function BM_good_suffix_rule(text, pattern):
    n = length(text)
    m = length(pattern)
    
    //If the pattern is empty return 0
    if m == 0 then
        return 0
    
    good_suffix_table = Create_Good_Suffix_Table(pattern)
    
    text_pointer = m //Search from the end of the pattern

    while (text_pointer <= n):
        pattern_pointer = m

        //If current characters match and there are characters left for matching in the pattern
        //shift both pointers to the left by one character
        while (pattern_pointer > 0) and (text[text_pointer] == pattern[pattern_pointer]):
            text_pointer = text_pointer - 1
            pattern_pointer = pattern_pointer - 1
        
        //there are no characters left for matching in the pattern
        if pattern_pointer == 0:
            print("Pattern found at ", text_pointer + 1)  // Match found
            text_pointer = text_pointer + m + 1 //Shift the pointer for further occurrences
            continue
       
        suffix_length = m - pattern_pointer //Length of the matched suffix

        good_suffix_shift = good_suffix_table[suffix_length] //Look up for the shift size
        
        //Update the text pointer
        text_pointer = text_pointer + good_suffix_shift

Complexity

The complexity of the Boyer-Moore Good Suffix Rule mainly depends on the characteristics of the input text and pattern being searched. In the best-case scenario, when the pattern occurs only once in the text and the good suffix heuristic allows for maximal shifts, the Boyer-Moore algorithm can achieve a linear time complexity of O(nm)O(\frac{n}{m}), where nn is the length of the text, and mm is the length of the pattern. However, in the worst-case scenario where the pattern and text share many overlapping suffixes, the algorithm can approach a quadratic time complexity of O(nm)O(nm). Nevertheless, in practical applications, the Boyer-Moore algorithm generally outperforms other string searching algorithms due to its efficient handling of mismatches and its ability to skip large portions of text.

Conclusion

In summary, the Boyer-Moore Good suffix rule involves two steps:

  1. Preprocess the pattern to create a good suffix table.

  2. During the search, calculate shifts using the table and apply them when there's a mismatch between the pattern and text.

The Boyer-Moore Good Suffix Rule is a critical component of the Boyer-Moore string searching algorithm, offering a systematic and efficient approach to locating patterns within text. By precomputing a suffix table and strategically shifting the pattern during searches, this rule optimizes the algorithm's performance, making it one of the fastest and most widely used string searching techniques in practical applications. In a later topic we will combine this rule with the bad character rule to develop the complete Boyer-Moore algorithm.

How did you like the theory?
Report a typo