Computer scienceAlgorithms and Data StructuresAlgorithmsPrinciples and techniquesHashing

String hashing

9 minutes read

So far, we familiarised ourselves with hashing techniques, hash functions, and unwanted collisions. If you recall from the previous topics, hash functions take integers as an argument, which is pretty silly. Indeed, this way the only thing we can hash is numbers, right? What about strings? They are everywhere, so hashing them is also a necessity, thus we would like to come up with a method of hashing them as well:

Hash function and hash value

String hashing is a technique that consists of two main steps: transforming the string to a number, and then applying a hash function to it. As we know, an important advantage of string hashing is that it makes it possible to compare two strings in $O(1)$ since we simply need to compare the strings' hash values. This property is used to efficiently solve various string problems, as we will see below. So let's get started!

Building blocks

Usually, a hash for a string is calculated as follows: each symbol of the string is associated with a number, then a hash value is computed as a sum of these numbers with some coefficients. There are several ways to associate a symbol with a number, called code. In this topic, we will use the following rule: $A$ corresponds to $1$ , $B$ corresponds to $2, ..., Z$ corresponds to $26$ . That is, each symbol is associated with its order number in the alphabet. As for string hashing functions, there are a few of them as well: for example, using the ASCII code. The table below illustrates our much simpler approach:

Letter ( $s_i$ )	A	B	C	...	Y	Z
Numeric value ( $c_i$ )	1	2	3	...	25	26

This means that we have established a rule for converting letters to numbers. What is left to do is to choose a rule for combining these numbers. This rule is exactly the hash function that is going to be applied to the string.

Linear hashing

If you think for a moment, the first rule that comes to your mind is simply adding those numbers. Formally speaking, for a string $s = s_0s_1...s_{n-1}$ , a linear hash function $h_L$ is defined as a sum of the symbols' associated values $c_i$ :

$h_{L}(s) = c_0 + c_1 + ... + c_{n-1}.$

For example, a hash value for $s = ABAC$ is $h_{L}(ABAC) = 1 + 2 + 1 + 3 = 7$ , since $A$ is the first letter of the alphabet, hence associated with $c_1 = 1$ , $B$ is the second, so $c_2 = 2$ , and $C$ is the third, so $c_4 = 3$ . Finally, we add these numbers, since we are using a linear hash function.

As expected, simplicity comes with drawbacks. A disadvantage of the linear hash function is that a hash value does not depend on the order of symbols. This means that if we reorder the symbols of a string, the hash value for the string won't change. For example, strings $s_1 = ABAC$ and $s_2 = CBAA$ are not equal, but they consist of the same symbols and thus have equal hash values:

$h_{L}(ABAC) = 1 + 2 + 1 + 3 = 7, \quad \quad \quad h_{L}(CBAA) = 3 + 2 + 1 + 1 = 7.$

As you remember from the previous topics, such situations when two different strings have equal hash values are called a collision. An important property of any hash function is how many strings it maps to the same hash value. The smaller the number of such strings, the better the hash function. At this point, linear hashing is not the best choice, since the limitation described above results in many collisions.

Polynomial hashing

In order to cope with this obstacle and avoid collisions, we need to come up with a different rule for combining the codes of each letter. Let's think (not overthink!) about how can we generalize the linear hash function. A reader with a solid Math background would immediately shout: "polynomials!". And that is correct: this method turns out to be far better than the linear one. Formally speaking, for a string $s = s_0s_1...s_{n-1}$ , a polynomial hash function is defined as follows:

$h_{P}(s) = \left(\red{c_0} \cdot a^{0} + \red{c_1} \cdot a^{1} + ... + \red{c_{n-1}} \cdot a^{n-1} \right) \ mod \ m.$

Here, $c_i$ are the codes of each letter $s_i$ , the number $a$ is a constant, usually a prime number approximately equal to the total number of different symbols in the alphabet; $m$ is a constant as well, usually a big prime number. Rightfully, you would ask: why is that so? Let's try to explain it in a few words:

Large values of $a$ produce unnecessary large hash values, which leads to uneven distribution. On the other hand, too small $a$ increase the chance of collisions.
Taking modulo $m$ after calculating the polynomial hash value is a must, in order to avoid large hash values. You should recall this trick from the previous topics.
Why primes? This requires some math knowledge, however, if you trust informal explanations, one can say that prime numbers are good numbers that reduce the chance of collisions.

Let's consider how we can calculate the polynomial hash for $s = ACDC$ . For simplicity, we will use $a = 3$ (and not $a=26$ ) and $m = 11$ :

$h_{P}(ABCD) = (\red{1} \cdot 3^{0} + \red{3} \cdot 3^{1} + \red{4} \cdot 3^{2} + \red{3} \cdot 3^{3}) \ mod \ 11 = (1\cdot 1 + 3 \cdot 3 + 4 \cdot 9 + 3 \cdot 27) \ mod \ 11 = 6$

Although the polynomial hash depends on the order of symbols in a string, collisions are still possible. For example, $s_1 = BBAB$ and $s_2 = ABCC$ are different strings with equal hash values:

$h_P(BBAB) = (\red{2} \cdot 3^{0} + \red{2} \cdot 3^{1} + \red{1} \cdot 3^{2} + \red{2} \cdot 3^{3}) \ mod \ 11 = 71 \ mod \ 11 = 5$

$h_P(ABCC) = (\red{1} \cdot 3^{0} + \red{2} \cdot 3^{1} + \red{3} \cdot 3^{2} + \red{3} \cdot 3^{3} \ mod \ 11 = 115 \ mod \ 11 = 5.$

However, the probability of a collision for the polynomial hash function is estimated to be approximately $\frac{1}{m}$ , which is quite low for a big $m.$ Thus, the polynomial hash function is a relatively good choice for string hashing, even though its calculation can take considerably more time than the linear one.

String hashing in practice

It is worth noting that programming languages do not use linear and polynomial hashing for their built-in methods. The hash functions we mentioned above serve for educational purposes, as well as helping tools to build more advanced approaches and techniques, as we will see in the following topic. Various algorithms are used for string hashing in programming languages. Some of the most popular ones include:

DJB2: The DJB2 hash function is simple and efficient. It iterates through each character in the input string, multiplying the current hash value by a prime number and adding the ASCII value of the character. This process continues until the entire string is hashed. DJB2 is known for its speed and reasonable distribution properties.
MurmurHash: MurmurHash is a family of hash functions designed for speed and good distribution. These functions are widely used in hash tables and other data structures. The MurmurHash algorithm applies a series of bitwise and arithmetic operations to the input string's bytes, producing a hash code.
SHA-256: The SHA-256 (Secure Hash Algorithm 256-bit) is a cryptographic hash function. While it's primarily designed for security, it's also used for hashing strings in non-security contexts due to its excellent distribution properties. SHA-256 produces a fixed-size (256-bit) hash code, which, combined with its algorithmic properties, contributes to its high resistance to collisions.
CityHash: CityHash is a hash function developed by Google. It is designed for performance and is particularly efficient with short strings. CityHash utilizes a combination of hashing techniques, including multiplicative and bitwise operations, to produce hash codes.

As much as we would like to discuss these cool functions in detail, we invite you to read more about them, as this is beyond the scope of this topic.

Conclusion

Here is a short summary of what we have discussed in this topic:

String hashing is a way to represent a string as a number.
It is useful for some string processing algorithms since hash values can be compared in $O(1)$ .
There are several ways to hash a string, linear and polynomial hashing being among them. The latter is a better choice for string hashing since it has fewer collisions.
In programming languages, more complicated functions are used, including DJB2, MurmurHash, SHA-256, and CityHash.

20 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo