Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Rolling hash

5 minutes read

We are already familiar with several string hashing functions, including linear and polynomial hash functions. This topic describes not a distinct function, but rather a property of hash functions that is applicable to both linear and polynomial hashing. Clever usage of this property is a cornerstone to building efficient string algorithms: substring search, text comparison, data compression, and many more.

The logic behind rolling hash

Let's consider the substring search task for a moment: we need to tell if a string s1s_1 is a substring of another string s2s_2. Somehow, we need to check if any of the substrings of s2s_2 matches s1s_1. We have previously learned an ingenious trick: we can compare not substrings, but their hash values. However, calculating hash values can be expensive in terms of time efficiency. This is when the rolling hash property comes in handy and allows us to efficiently solve some substring processing problems.

But what is the idea behind rolling hash? In three words, eliminating repetitive calculations. Imagine you are searching for DABRADABRA in ABRACADABRAABRACADABRA. Say, you have calculated the hash value of ABRACABRAC, compared it with the hash value of DABRADABRA, and concluded that it is a mismatch. Then, it is logical to compute the hash value of BRACABRACA.

However, we see that BRACBRAC is common for ABRACABRAC and BRACABRACA, so we perform calculations twice on this substring instead of remembering the value from the previous calculations. This is exactly what rolling hash does: it stores the result for BRACBRAC, computes the value for the new AA in the end of BRACABRACA, and subtracts the already calculated value of the AA in the beginning of ABRACABRAC. Now, the name rolling hash should make more sense: it is just like rolling:

Rolling like a hash

Linear rolling hash

The idea described above corresponds to the linear rolling hash property. Let's see an example: consider a string s=ABBCCBs = ABBCCB. Let's use the linear hash function to calculate a hash for a prefix of ss of length 44:

hL(ABBC)=s0+s1+s2+s3=1+2+2+3=8. h_L(ABBC) = s_0 + s_1 + s_2 + s_3 = 1 + 2 + 2 + 3 = 8.

Now, assume we need to calculate a hash value for the next substring of ss of length 44. This can be done as follows:

hL(BBCC)=s1+s2+s3+s4=2+2+3+3=10.h_L(BBCC) = s_1 + s_2 + s_3 + s_4 = 2 + 2 + 3 + 3 = 10.

Note that the second sum can also be obtained if we subtract s0s_0 from the first sum and then add s4s_4:

hL(BBCC)=hL(ABBC)s0+s4=81+3=10.h_L(\green{BBC}\blue{C}) = h_L(\red{A}\green{BBC})-\red{s_0} + \blue{s_4} = 8-1 + 3 = 10.

Thus, if we know a hash value hL(si...sj)h_L(s_i...s_j), a hash value for the next substring can be calculated in O(1)O(1) as follows:

hL(si+1...sjsj+1)=hL(sisi+1...sj)si+sj+1.h_L(\green{s_{i+1}...s_j}\blue{s_{j+1}}) = h_L(\red{s_{i}}\green{s_{i+1}...s_{j}})-\red{s_i} + \blue{s_{j + 1}}.

So, instead of jij-i additions, we perform only 2 elementary operations: one addition and one subtraction. This property of a hash function is called a rolling hash.

Polynomial rolling hash

The same property holds to the polynomial hash function: the only difference is that we need to start the calculations from the last substring (suffix) of ss for the property to work. We will explain that in a bit, but first, let's get back to the example from the section above.

Consider a string s=ABBCCBs = ABBCCB. Let's use the polynomial hash function to calculate a hash for a suffix of ss of length 44:

hP(BCCB)=(s230+s331+s432+s533) mod 11=(230+331+332+233) mod 11= 92 mod 11=4.\begin{align*} h_P(BCCB) = &\left(s_2 \cdot 3^{0} + s_3 \cdot 3^{1} + s_4 \cdot 3^{2} + s_5 \cdot 3^{3} \right) \ mod \ 11 \\ = &\left(2 \cdot 3^{0} + 3 \cdot 3^{1} + 3 \cdot 3^{2} + 2 \cdot 3^{3} \right) \ mod \ 11 \\ = &\ 92 \ mod \ 11 = 4. \end{align*}

The next substring of ss from the end has the following hash value: hP(BBCC)=(s130+s231+s332+s433) mod 11=(230+231+332+333) mod 11= 116 mod 11=6.\begin{align*} h_P(BBCC) = &\left(s_1 \cdot 3^{0} + s_2 \cdot 3^{1} + s_3 \cdot 3^{2} + s_4 \cdot 3^{3} \right) \ mod \ 11 \\ = &\left(2 \cdot 3^{0} + 2 \cdot 3^{1} + 3 \cdot 3^{2} + 3 \cdot 3^{3} \right) \ mod \ 11 \\ = &\ 116 \ mod \ 11 = 6. \end{align*}

Using the hash value for the first substring, the second one can be obtained as follows:

hP(BBCC)=((hP(BCCB)s533)3+s1) mod 11=((4227)3+2) mod 11= 148 mod 11=6.\begin{align*} h_P(\red{B}\green{BCC}) = &\left( \left( h_P(\green{BCC}\blue{B})-\blue{s_5} \cdot 3^{3} \right) \cdot 3 + \red{s_1} \right) \ mod \ 11 \\ = &\left( \left(4-2 \cdot 27\right) \cdot 3 + 2 \right) \ mod \ 11 \\ = &\ -148 \ mod \ 11 = 6. \end{align*}

So, if we know the hash value hP(si+1si+2...sj)h_P(s_{i+1}s_{i+2}...s_{j}), a hash value for a neighboring substring can be calculated as follows:

hP(sisi+1...sj1)=((hP(si+1...sj1sj)sjaji1)a+si) mod m.h_P(\red{s_{i}}\green{s_{i+1}...s_{j-1}}) = \left(\left( h_P(\green{s_{i+1}...s_{j-1}}\blue{s_j})-\blue{s_j} \cdot a^{\blue{j}-\green{i-1}} \right) \cdot a + \red{s_i} \right) \ mod \ m.

It's finally time to explain this scary formula, if you are wondering, why it looks like that.
  • Why, unlike linear rolling hash, here we have to start calculations from suffixes, and not prefixes? — This right-to-left approach is required for the implementation of the Rabin-Karp algorithm, as we will see in the following topic. The left-to-right approach is also possible.
  • Where does aji1a^{j-i-1} come from? — Notice that ji1j-i-1 is the length of the string sisi+1sj1s_is_{i+1}\dots s_{j-1}, so this is all about shifting. Indeed:
    • When jij-i is 1 (meaning there's only one character between sis_i and vv), aji1a^{j-i-1} equals a0a^0, which is 1. So, the character at sjs_j has its original value.
    • As you move further to the left, the weight of each character decreases exponentially. The character at sj1s_{j-1} has a weight of aa, the one before that has a weight of a2a^2, and so on.

  • Why do we multiply by aa? — Similarly, this multiplication simulates shifting the rolling hash one position to the right in the string.

Conclusion

So, here is a short summary of what we discussed in this topic:

  • Rolling hash is a property applicable to both linear and polynomial hash functions in string processing.

  • The key idea is to eliminate redundant hash calculations by updating the hash value incrementally as we move through the string.

  • In linear rolling hash, calculations start from the prefix of the string and efficiently update the hash as the window moves from left to right.

  • Polynomial rolling hash, on the other hand, uses a right-to-left approach, starting from the suffix of the string.

  • Rolling hash is a fundamental concept in string processing, optimizing hash value computation for various substring-related algorithms and applications. It enables efficient solutions to substring search, text comparison, data compression, and more.

4 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo