Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Calculating prefix function

16 minutes read

Are you prepared to take your string algorithm expertise to the next level? That's exactly what this topic is designed to help you do. Our star player is the prefix function! It's like a secret map, showing us how long the longest matching start and end (we call this a border) of each part of a string is. We've already learned the easy-peasy way to find it, but now we're going to learn a super-speedy way!

By using our problem-solving skills, we'll discover some interesting facts about the border length. These facts will help us create a quicker method to find the prefix function. Sounds exciting, doesn't it? So let's jump in and enjoy learning some cool tricks!

A quick recap

So, how did we calculate the prefix functions before? Directly by definition – a method that is known as the naive approach. Namely, for each one of nn prefixes s[1,i]s[1,i] of the initial string, we find the longest border by comparing the prefixes and suffixes of every length to detect the borders, and then we choose the longest amongst them. Overall, there are nn substrings, and for each one, we find the longest border. It takes O(n2)O(n^2), since we compare two strings in linear time for every possible length. Hence the general time complexity is O(n3)O(n^3), which is too long. Here is the pseudocode for this approach:

function compute_prefix_function(s):
    n = length(s)
    p = array[1, n]
    p[1] = 0

    for i in [2, n]:
        p[i] = 0
        for j in [1, i-1]:
            if s[1, j] = s[i-j+1, i]:
                p[i] = j
    return p

In practice, strings can have enormous lengths, so we need to come up with a much more efficient algorithm. But how will we accomplish this?

Limit of border length

Like a detective, take your magnifying glass and observe the borders and some of its interesting properties. These observations will play a crucial role in solving our problem.

First, skip unnecessary comparisons using the fact that p[i+1]p[i]+1p[i+1] \leq p[i] + 1. In other words, if we know the value of p[i]p[i], we can conclude that the value of p[i+1]p[i+1] can't increase by more than one. This means that we have to search for the longest border among the prefix-suffixes of length no more than p[i]+1p[i]+1, avoiding many unnecessary comparisons. Let's illustrate this using an example: MAMMAMIAMAMMAMIA. Say we have calculated p[5]p[5] and want to calculate p[6]p[6]. We do this as follows:

The first observation of the border

Once we've computed p[5]p[5], we're aware that MAMA is the prefix that aligns with the suffix of the substring s[1,5]s[1,5]. Now, for the substring s[1,6]s[1,6], we simply need to verify if the newly added character MM at the end matches with the character following the previous border MAMA. In other words, we want to check if s[6]s[6] is equal to s[p[5]+1]s[p[5] + 1] or s[3]s[3]. If there's a match, our border length will increase by one. By no means, we can obtain more than that.

You might be wondering, what happens if there isn't a match? Do we have to compare all characters at this point?

The simple answer is no, we don't. We illustrate the scenario in the next section.

Border inside border

Let's assume our string is s=MMAMMMAs=MMAMMMA. For the sake of simplicity, we are provided with the first 55 elements of its prefix function: p=[0,1,0,1,2,_ ,_ ]p = [0, 1, 0, 1, 2, \_\ ,\_\ ]. We're trying to figure out what p[6]p[6] is. We know that p[5]=2p[5] = 2 means that the border of s[1,5]s[1,5] is s[1,2]s[1,2] which is also equal to s[4,5]s[4,5]. Please keep that in mind while reading the rest of the section.

When we add a new character MM to make s[1,6]s[1,6], it doesn't match with s[3]s[3]. So, we can't just add 11 to the border length of p[5]p[5] like we did before. This means the new suffix MMMMMM can't be the border.

Observation 2

In this case, the border length can't be more than p[5]p[5] i.e. 22. This gives us a clue about what to do next. We need to look at the first two characters, which are s[1,2]s[1,2], and the last two characters, which are s[5,6]s[5,6].

To find the border of s[1,6]s[1,6], we need to see if there's any suffix of s[5,6]s[5,6] that is also a prefix of s[1,2]s[1,2]. But remember s[1,2]s[1,2] is the border of s[1,5]s[1,5], so s[1,2]s[1,2] = s[4,5]s[4,5]. Hence we can rephrase the highlighted sentence: to find the border of s[1,6]s[1,6], we need to check if there is any suffix of s[5,6]s[5,6] that is also a prefix of s[4,5]s[4,5].

If you think about what a border is, you'll understand from the last line that the border of s[1,6]s[1,6] is actually the same as the border of s[4,6]s[4,6].

Let's take a moment to appreciate what we've achieved so far. We've managed to shrink our border search area from a length of 66 down to just 33! That's a significant improvement in our algorithm.

Now, let's turn our attention to how we can determine the border of s[4,6]s[4,6], or in other words, the border of MMMMMM. To start, let's pose a question. If we tell you the border length of s[4,5]s[4,5] is xx, do you think you could figure out the border length of s[4,6]s[4,6]?

Absolutely, you can! Based on what we've learned before, the border length of s[4,6]s[4,6] will be 11 greater than the border length of s[4,5]s[4,5] if the new character s[6]s[6] matches with s[x+4]s[x+4]. If not, the border of s[4,6]s[4,6] will be the same as the border of s[6x,6]s[6-x, 6].

Hold on a second! s[4,5]s[4,5] is nothing but s[1,2]s[1,2]. And we know the border length of s[1,2]s[1,2] is p[2]=1p[2] = 1. So, the border length of s[1,6]s[1,6], which is also the border length of s[4,6]s[4,6], is p[2]+1p[2] + 1 if s[6]s[6] matches with s[2]s[2]. It's a match! So, we can confidently say that p[6]=p[2]+1p[6] = p[2] + 1, which equals 1+11 + 1, and that gives us 22.

Now pause for a moment and take a deep breath! You've now grasped the essential insights that form the foundation of our efficient algorithm for calculating the prefix function. With these two key observations under your belt, we now encourage you to confirm that p[7]=0p[7] = 0.

Efficient algorithm

The insights we've gathered thus far enable us to construct an accurate and efficient algorithm for computing the prefix function of a specified string. The procedure and corresponding pseudocode are outlined below. We recommend grabbing a sheet of paper and a pencil. Practice these steps using strings from the previous examples. You'll understand them quickly with a bit of hands-on exploration.

  1. Set p[1]=0p[1]=0.

  2. For every prefix s[1,i]s[1,i] of the string ss calculate the value of p[i]p[i] as follows: define j=p[i1]j=p[i-1].

  3. Compare the symbols s[j+1]s[j+1] and s[i]s[i]. If they match, set p[i]=j+1p[i] = j +1.

  4. Otherwise, take j=p[j]j=p[j] and repeat step 3. If we reach j=0j=0 and there is no match, we conclude that p[i]=0p[i] = 0.

function compute_prefix_function(s):
    n = length(s)
    p = array[1, n]
    p[1] = 0

    for i in [2, n]:
        j = p[i - 1]

        while (j > 0) and (s[j + 1] ≠ s[i]):
            j = p[j]
        
        if s[j + 1] = s[i] then
            p[i] = j + 1
        else
            p[i] = 0
    return p

Let us see how the algorithm works using an example. We are going to compute the prefix function for the string s=ACCABACCACs=ACCABACCAC. For obvious reasons, we are not going to explain every iteration of the algorithm. Assume that we've already calculated the values of the prefix function for i<7i<7 and got the result [0,0,0,1,0,1][0,0,0,1,0,1] (try it!). Let's calculate p[7]p[7] based on the algorithm. Define j=p[6]=1j=p[6] = 1. Check if s[2]=s[7]s[2] = s[7]. In our case, s[2]=Cs[2] = C and s[7]=Cs[7] = C — it is a match, hence we assert that p[7]=p[6]+1=2p[7] = p[6] + 1 = 2. Just like our first observation!

Example of efficient algorithm

Similarly, we get that p[8]=3p[8] = 3 and p[9]=4p[9] = 4. For the sake of better understanding, we will explain in detail the computation of p[10]p[10].

  • Define j=p[9]=4j=p[9] = 4.

  • Check if s[5]=s[10]s[5] = s[10]. We have s[5]=Bs[5] = B and s[10]=Cs[10] = C.

  • They are not the same, hence take j=p[j]j=p[j], i.e. j=p[4]=1j = p[4] = 1.

Why this step? Remember our second observation: Border(s[1,10])Border(s[1,10]) = Border(s[6,10])Border(s[6,10]). And to find the Border(s[6,10])Border(s[6,10]) we need to know the Border(s[6,9])Border(s[6,9]) which is the same as the Border(s[1,4])Border(s[1,4]) i.e. p[4]p[4])

  • Check again if s[1+1]=s[10]s[1+1] = s[10]. We obtain C=CC = C, a match. Hence we set p[10]=p[4]+1=2p[10] = p[4] + 1 = 2.

All the values have been found, therefore, the prefix function of the string s=ACCABACCACs=ACCABACCAC is [0,0,0,1,0,1,2,3,4,2][0,0,0,1,0,1, 2, 3, 4, 2].

Complexity analysis

After many efforts, we have come up with an effective algorithm. However, one question still remains: what is the time complexity of our new algorithm? The answer should cheer you up: the efficient algorithm works on O(n)O(n), which is far better than O(n3)O(n^3) of the naive approach. Anyway, why O(n)O(n) and not O(n2)O(n^2), if we have nn iterations and during each iteration we reduce jj until we get a match?

The answer is tricky: the total time of the algorithm depends on the total number of times we decrease the value of jj. However, during every iteration, we know that its value can't increase by more than one. Since we start from j=0j=0, the maximum value of jj can be n1n-1. This means that during all the iterations, we can't decrease jj more than nn times, therefore, the time complexity of our algorithm is O(n)O(n).

Why prefix function

At first, the prefix function was introduced in order to speed up the process of substring search. Indeed, it provides us with sufficient information on the structure of the string, so that we can avoid repetitions and unnecessary comparisons. This gives us one of the most effective substring search algorithms as we will soon see.

However, except for the classical application, there are dozens of problems that we can solve by using the prefix function smartly. Let's shortly mention some of them:

  1. Counting the number of occurrences of every prefix. Given a string ss, we find the number of occurrences of each prefix s[1,i]s[1,i] in it. Moreover, we can count the number of occurrences of these prefixes in another string ww. This task is essential in bioinformatics while examining long sequences of DNA, for example.

  2. Counting the number of different substrings in a string. The name is self-explanatory. An efficient algorithm for substring search is known as the Knuth-Morris-Pratt algorithm, which we will learn in a later topic. Such a task is fundamental for the following problem.

  3. Compressing a text. Given a string ss, we want to find a shorter string ww, such that ss is a concatenation of copies of ww. Text compression is crucial: think of a long text file. If we store the information as it is, it would take an immense amount of space; however, after compressing it, we end up with a file of several kilobytes.

Unfortunately, the algorithms solving these problems are out of the scope of our platform. Nevertheless, you will find some Easter eggs in the problem section.

Conclusion

Prefix function is a tool with many applications in string algorithms. It provides us with essential information on the repeating parts of the string, and not only about that. You can calculate the prefix function of a string of a length nn in O(n)O(n) time using these two key observations:

  • The border length will increase by one if the new character matches the character after the previous prefix.

  • Otherwise, the new border can be found inside the previous border.

In the following topic, we will see one of the most effective substring search algorithms, which uses the prefix function as a helping tool to avoid repetitions and unnecessary comparisons. For now, try to take in this theory and practice in the exercise section. Good job!

6 learners liked this piece of theory. 4 didn't like it. What about you?
Report a typo