Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Rolling hash

5 minutes read

We are already familiar with several string hashing functions, including linear and polynomial hash functions. This topic describes not a distinct function, but rather a property of hash functions that is applicable to both linear and polynomial hashing. Clever usage of this property is a cornerstone to building efficient string algorithms: substring search, text comparison, data compression, and many more.

The logic behind rolling hash

Let's consider the substring search task for a moment: we need to tell if a string $s_1$ is a substring of another string $s_2$ . Somehow, we need to check if any of the substrings of $s_2$ matches $s_1$ . We have previously learned an ingenious trick: we can compare not substrings, but their hash values. However, calculating hash values can be expensive in terms of time efficiency. This is when the rolling hash property comes in handy and allows us to efficiently solve some substring processing problems.

But what is the idea behind rolling hash? In three words, eliminating repetitive calculations. Imagine you are searching for $DABRA$ in $ABRACADABRA$ . Say, you have calculated the hash value of $ABRAC$ , compared it with the hash value of $DABRA$ , and concluded that it is a mismatch. Then, it is logical to compute the hash value of $BRACA$ .

However, we see that $BRAC$ is common for $ABRAC$ and $BRACA$ , so we perform calculations twice on this substring instead of remembering the value from the previous calculations. This is exactly what rolling hash does: it stores the result for $BRAC$ , computes the value for the new $A$ in the end of $BRACA$ , and subtracts the already calculated value of the $A$ in the beginning of $ABRAC$ . Now, the name rolling hash should make more sense: it is just like rolling:

Rolling like a hash

Linear rolling hash

The idea described above corresponds to the linear rolling hash property. Let's see an example: consider a string $s = ABBCCB$ . Let's use the linear hash function to calculate a hash for a prefix of $s$ of length $4$ :

$h_L(ABBC) = s_0 + s_1 + s_2 + s_3 = 1 + 2 + 2 + 3 = 8.$

Now, assume we need to calculate a hash value for the next substring of $s$ of length $4$ . This can be done as follows:

$h_L(BBCC) = s_1 + s_2 + s_3 + s_4 = 2 + 2 + 3 + 3 = 10.$

Note that the second sum can also be obtained if we subtract $s_0$ from the first sum and then add $s_4$ :

$h_L(\green{BBC}\blue{C}) = h_L(\red{A}\green{BBC})-\red{s_0} + \blue{s_4} = 8-1 + 3 = 10.$

Thus, if we know a hash value

h_L(s_i...s_j)

, a hash value for the next substring can be calculated in

O(1)

as follows:

$h_L(\green{s_{i+1}...s_j}\blue{s_{j+1}}) = h_L(\red{s_{i}}\green{s_{i+1}...s_{j}})-\red{s_i} + \blue{s_{j + 1}}.$

So, instead of $j-i$ additions, we perform only 2 elementary operations: one addition and one subtraction. This property of a hash function is called a rolling hash.

Polynomial rolling hash

The same property holds to the polynomial hash function: the only difference is that we need to start the calculations from the last substring (suffix) of $s$ for the property to work. We will explain that in a bit, but first, let's get back to the example from the section above.

Consider a string $s = ABBCCB$ . Let's use the polynomial hash function to calculate a hash for a suffix of $s$ of length $4$ :

$\begin{align*} h_P(BCCB) = &\left(s_2 \cdot 3^{0} + s_3 \cdot 3^{1} + s_4 \cdot 3^{2} + s_5 \cdot 3^{3} \right) \ mod \ 11 \\ = &\left(2 \cdot 3^{0} + 3 \cdot 3^{1} + 3 \cdot 3^{2} + 2 \cdot 3^{3} \right) \ mod \ 11 \\ = &\ 92 \ mod \ 11 = 4. \end{align*}$

The next substring of $s$ from the end has the following hash value: $\begin{align*} h_P(BBCC) = &\left(s_1 \cdot 3^{0} + s_2 \cdot 3^{1} + s_3 \cdot 3^{2} + s_4 \cdot 3^{3} \right) \ mod \ 11 \\ = &\left(2 \cdot 3^{0} + 2 \cdot 3^{1} + 3 \cdot 3^{2} + 3 \cdot 3^{3} \right) \ mod \ 11 \\ = &\ 116 \ mod \ 11 = 6. \end{align*}$

Using the hash value for the first substring, the second one can be obtained as follows:

$\begin{align*} h_P(\red{B}\green{BCC}) = &\left( \left( h_P(\green{BCC}\blue{B})-\blue{s_5} \cdot 3^{3} \right) \cdot 3 + \red{s_1} \right) \ mod \ 11 \\ = &\left( \left(4-2 \cdot 27\right) \cdot 3 + 2 \right) \ mod \ 11 \\ = &\ -148 \ mod \ 11 = 6. \end{align*}$

So, if we know the hash value

h_P(s_{i+1}s_{i+2}...s_{j})

, a hash value for a neighboring substring can be calculated as follows:

$h_P(\red{s_{i}}\green{s_{i+1}...s_{j-1}}) = \left(\left( h_P(\green{s_{i+1}...s_{j-1}}\blue{s_j})-\blue{s_j} \cdot a^{\blue{j}-\green{i-1}} \right) \cdot a + \red{s_i} \right) \ mod \ m.$

It's finally time to explain this scary formula, if you are wondering, why it looks like that.

Why, unlike linear rolling hash, here we have to start calculations from suffixes, and not prefixes? — This right-to-left approach is required for the implementation of the Rabin-Karp algorithm, as we will see in the following topic. The left-to-right approach is also possible.
Where does $a^{j-i-1}$ come from? — Notice that $j-i-1$ is the length of the string $s_is_{i+1}\dots s_{j-1}$ , so this is all about shifting. Indeed:
- When $j-i$ is 1 (meaning there's only one character between $s_i$ and $v$ ), $a^{j-i-1}$ equals $a^0$ , which is 1. So, the character at $s_j$ has its original value.
- As you move further to the left, the weight of each character decreases exponentially. The character at $s_{j-1}$ has a weight of $a$ , the one before that has a weight of $a^2$ , and so on.
Why do we multiply by $a$ ? — Similarly, this multiplication simulates shifting the rolling hash one position to the right in the string.

Conclusion

So, here is a short summary of what we discussed in this topic:

Rolling hash is a property applicable to both linear and polynomial hash functions in string processing.
The key idea is to eliminate redundant hash calculations by updating the hash value incrementally as we move through the string.
In linear rolling hash, calculations start from the prefix of the string and efficiently update the hash as the window moves from left to right.
Polynomial rolling hash, on the other hand, uses a right-to-left approach, starting from the suffix of the string.
Rolling hash is a fundamental concept in string processing, optimizing hash value computation for various substring-related algorithms and applications. It enables efficient solutions to substring search, text comparison, data compression, and more.

4 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

Rolling hash

The logic behind rolling hash

Linear rolling hash

Polynomial rolling hash

Conclusion

Related topics