When we search inside a string, there are certain questions that come to mind. Does a string have a pattern? And if it has some recurrent parts, how many of them are there? We need to know how to answer them because they can direct our actions: for example, if a part of a string repeats itself, we can easily compress this string, and so on.
In this topic, we will introduce a unique structure called prefix function that will allow us to investigate the borders of substrings. Then, we will develop an approach to find the prefix function using its definition.
Border and its length
Recall that a border of a string is a proper prefix of that is also a suffix of that string. It is worth noting that a string usually has more than one border and it is completely normal for borders to intersect themselves. Let's escape from abstraction by taking a look at a couple of examples:
An array of border lengths
We are ready to define our main tool now, the prefix function. First of all, let's grasp the main idea:
A prefix function is an array specific to a string. So, let's consider a string first: .
Our computations will yield an array of numbers, the length of which corresponds to the length of the string. Specifically, for our scenario, the prefix function will produce an array with a length of .
Moving forward, we'll examine particular substrings from our provided string. For instance, and so forth are all substrings of . Nonetheless, our focus will be solely on those substrings that initiate from the start of our provided string, in other words, all the prefixes. This includes and .
Now we will examine the borders of those substrings one by one.
Starting with the first substring , doesn't have a border, as it lacks a proper prefix. By definition, a proper suffix or prefix cannot equal the string itself.
Next, the substring also doesn't have a border because it doesn't have a proper prefix that is also a suffix.
The substring does have a border, as serves as both the prefix and suffix. Consequently, it has a border length of .
The substring also has a border length of .
has as its border, resulting in a border length of .
has a border length of .
doesn't have a border.
Finally, also lacks a border.
The prefix function for is the array that consists of the lengths of the borders for those substrings, that is, .
Definition of the prefix function
If the example above is clear now, let's move on to the official definition. Given a string of length . One can define the prefix function of as an array of length , where is the length of the longest border of the substring . It is obvious that , since a string of length can't have nonempty proper prefixes.
In order to get a better understanding of this notion, let's calculate the prefix function of some strings. Take the string and another string . First of all, we will examine the prefixes of those two strings and find the longest borders for each one of them:
Following the definition, after some observations (see the picture on the left), we conclude that the prefix function of the first string is . For example, , because the longest border of the substring is , whose length is
String , and it is a bit trickier! Keep in mind that a border is a proper prefix-suffix and it might intersect itself. Here, we conclude that as in the picture on the right.
It might seem contradictory that the prefix function is an array and not a function. Well, the definition of an array is a simple and correct way to get an idea of it. Formally, the prefix function is a function from the set (the of the prefixes ) to (the possible lengths of the borders). However, this won't play any role later on; it is simply to recognize that the word function in the prefix function makes actual sense and is not just put randomly there.
An algorithm from the definition
So, how did we calculate the prefix functions above? Directly by definition – a method that is known as the naive approach. Here is the pseudocode for this approach:
function compute_prefix_function(s)
n = length(s)
p = array[1, n]
p[1] = 0
for i in [2, n]:
p[i] = 0
for j in [1, i-1]:
if s[1, j] = s[i-j+1, i]:
p[i]=j
return pHere for each one of prefixes of the initial string, we find the longest border by comparing the prefixes and suffixes of every length to detect the borders, and then we choose the longest amongst them. Every steps are according to our official definition of the prefix function. Now, let's analyze its performance.
Complexity
There are substrings in our string . For each one, it takes time to find the longest border since we compare two strings in linear time for every possible length. Hence the general time complexity is i.e. which is too long.
In practice, strings can have enormous lengths, so we need to come up with a much more efficient algorithm. In order to accomplish this, we will utilize the characteristics of borders in a future topic.
Conclusion
In this topic, we introduced a new structure of string called the prefix function and developed a naive algorithm to find it. Here are the steps of our algorithm:
Create an empty array of length .
For each find the length of the largest border of the prefix .
Understanding the prefix function is crucial for developing advanced string algorithms. Therefore, it is important to properly comprehend this topic. Later on, we will explore a more efficient algorithm with linear time complexity to determine the prefix function of a string.