Computer scienceAlgorithms and Data StructuresAlgorithmsString algorithmsSubstring search algorithms

Rabin-Karp algorithm

8 minutes read

For a pattern pp and a text tt, the simplest substring searching algorithm performs a symbol-by-symbol comparison of the pattern with each substring of the text, which results in
O(pt)O(|p| \cdot |t|) running time. If you work with large texts and performance is crucial, this might be too slow. To find a substring faster, we need to use more efficient approaches than the naive direct comparison.

One of the algorithms that help search a substring faster is the Rabin-Karp algorithm. It uses string hashing for a faster comparison thus significantly reducing the total running time compared with the naive approach.

Algorithm description

The main idea of the Rabin-Karp algorithm is to use string hashing and compare strings' hash values instead of a direct comparison. To compute the substrings' hash values fast, the algorithm uses the polynomial hash functions and their rolling hash property. In more detail, the algorithm can be formulated as follows:

  1. Calculate the hash value for a pattern.
  2. Moving along the text (from right to left), calculate the hash value for the current substring of the text using the rolling hash property.
  3. If hash values for the pattern and the current substring are not equal, move to the next substring. Otherwise, perform a symbol-by-symbol comparison of the strings. If the strings are indeed equal, add the current index to a list of occurrences.
  4. Repeat steps 2-3 until all substrings of the text are processed. Then, return a list of all occurrences.

An example

Consider how the algorithm works for a pattern ACDCACDC and a text BACDCCBABACDCCBA. For simplicity, we will use the following constants for the polynomial hash function: a=3a = 3 and m=11m = 11. First, we need to calculate the hash for the pattern:

hP(ACDC)=(130+331+432+333) mod 11=127 mod 11=6.h_P(ACDC) = \left(1 \cdot 3^{0} + 3 \cdot 3^{1} + 4 \cdot 3^{2} + 3 \cdot 3^{3} \right) \ mod \ 11 = 127 \ mod \ 11 = 6.

The next step is to calculate the hash value for the first suffix of the text:

hP(CCBA)=(330+331+232+133) mod 11=57 mod 11=2h_P(CCBA) = \left(3 \cdot 3^{0} + 3 \cdot 3^{1} + 2 \cdot 3^{2} + 1 \cdot 3^{3} \right) \ mod \ 11 = 57 \ mod \ 11 = 2.

Since hP(CCBA)hP(ACDC)h_P(CCBA) \ne h_P(ACDC), we know that the strings are not equal as well. The next step is to calculate the hash value for the next substring of the text. Using the rolling hash property, you can do that as follows:

hP(DCCB)=((hP(CCBA)133)3+4) mod 11=((227)3+4) mod 11=6.h_P(DCCB) = \left(\left(h_P(CCBA)-1 \cdot 3^{3}\right) \cdot 3 + 4 \right) \ mod \ 11 = \left((2-27) \cdot 3 + 4\right) \ mod \ 11 = 6.

The hash values of the pattern and the current substring are equal, so we need to check that the strings are indeed equal. Since DCCBACDCDCCB \ne ACDC, the attempt is not successful, so we move to the next substring:

hP(CDCC)=((hP(DCCB)233)3+3) mod 11=((654)3+3) mod 11=2.h_P(CDCC) = \left(\left(h_P(DCCB)-2 \cdot 3^{3}\right) \cdot 3 + 3 \right) \ mod \ 11 = \left((6-54) \cdot 3 + 3\right) \ mod \ 11 = 2.

We can see that this hash value is not equal to the hash value of the pattern, thus the strings are not equal. The next substring to consider is ACDCACDC:

hP(ACDC)=((hP(CDCC)333)3+1) mod 11=((281)3+1) mod 11=6.h_P(ACDC) = \left(\left(h_P(CDCC)-3 \cdot 3^{3}\right) \cdot 3 + 1 \right) \ mod \ 11 = \left((2-81) \cdot 3 + 1\right) \ mod \ 11 = 6.

The hash values as well as substring match, hence we have found an occurrence. The last substring to process is BACDBACD:

hP(BACD)=((hP(ACDC)333)3+2) mod 11=((681)3+2) mod 11=8.h_P(BACD) = \left(\left(h_P(ACDC)-3 \cdot 3^{3}\right) \cdot 3 + 2 \right) \ mod \ 11 = \left((6-81) \cdot 3 + 2\right) \ mod \ 11 = 8.

Since hP(BACD)hP(ACDC)h_P(BACD) \ne h_P(ACDC), the strings are not equal. All the substrings of the text are processed and the algorithm is finished.

Complexity analysis

Let's estimate the running time of the algorithm for a pattern pp and a text tt (assuming that tp|t| \ge |p|).

The first step of the algorithm is to calculate the hash value for the pattern and then for the first suffix of the text. This requires 2p2 \cdot |p| operations. After that, we calculate the hash value for the remaining tp|t|-|p| substrings of the text using the rolling hash property. Each of these calculations can be done in O(1)O(1). When the hash values of the pattern and the current substring are equal, we perform a symbol-by-symbol comparison which requires p|p| operations in the worst case. So, the overall running time of the algorithm can be estimated as

2p+(tp+occp)=O(t+occp),2 \cdot |p| + (|t|-|p| + occ \cdot |p|) = O(|t| + occ \cdot |p|),

where occocc is the number of times the pattern occurs in the text.

Note that in some cases, the algorithm may work in O(tp)O(|t| \cdot |p|). For example, for p=AAAp = \textrm{AAA} and t=AAAAAAAAt = \textrm{AAAAAAAA}, the algorithm will perform a symbol-by-symbol comparison of the pattern with each substring of the text, which results in the worst-case running time.

Also, the same situation may happen if there are a lot of collisions between the pattern and substrings of the text. However, for the polynomial hash function, the probability of a collision is 1m\approx \frac{1}{m}, which is low for a big mm.

To sum up, although in some cases the Rabin-Karp algorithm may work in O(tp)O(|t| \cdot |p|), in reality, such cases are quite rare, which makes this algorithm a very good and efficient alternative to the naive substring searching approach.

72 learners liked this piece of theory. 7 didn't like it. What about you?
Report a typo