Computer scienceAlgorithms and Data StructuresAlgorithmsPrinciples and techniquesHashing

Simplistic hash functions

6 minutes read

Enough with the abstraction, right? In the previous topic, we've studied the features of hash functions, but we haven't seen one in action yet. The hash functions employed in everyday programming usually take one type of value and return integers or strings. They can be rather involved, so in this section, we'll describe some simple and educational examples that still demonstrate how hash functions operate.

To employ hash functions, we need to understand the notation first. Typically, hash functions are denoted by hh. The hash value for a specific item xx is denoted by h(x)h(x). Let's look at some of the simplest ways to build hash functions for standard data types!

Integers

How can we hash integers? One approach is to use the identity function, in other words h(x)=x,h(x) = x, or the modulo operation: h(x)=x % ph(x) = x \ \%\ p, for some number pp. The modulo operation x % px\ \%\ p gives the remainder when dividing xx by pp. The first function appears perfect: it offers maximum speed since we don't calculate anything, and it's clearly deterministic and uniform with no collisions. However, its drawback is the wide range of hash values, which is where the modulo variation helps. With modulo, the possible hash values will only be 0,1,,p10, 1, \dots, p-1.

Linear hash

Let's understand these formulas clearly by seeing how they work with a random number, like 1010. The identity function always returns the input number, so we get h(10)=10h(10)=10. For the modulo operation, if we choose p=7p = 7, we get h(10)=10 % 7=3h(10) = 10\ \%\ 7 = 3. If we choose p=10p = 10, we get h(10)=10 % 10=0h(10) = 10\ \%\ 10 = 0.

Integer arrays

What about arrays? Suppose we have the array [v1,v2,...,vn][v_1, v_2,...,v_n] and you want to calculate its hash value. The answer is simple: pick a number pp and use the formula hn=v1pn1+v2pn2+...+vnh_n= v_1*p^{n-1} + v_2*p^{n-2}+...+v_n. Directly calculating it this way would result in a relatively large time complexity, but we can get around this by defining h0=0h_0 = 0 and then using the following operations: hi+1=hip +vi+1h_{i+1}=h_i*p\ +v_{i+1}. The hash value of the array will be hnh_n. Once again, to avoid wide sets of hash values, we can take the values modulo qq, for some number qq. That is, we can apply the formula hi+1=(hip +vi+1)  mod  qh_{i+1}=(h_i*p\ +v_{i+1})\; mod\;q.

Integer arrays

To clarify, let's calculate the hash value for the array [1,2,3,4][1, 2, 3, 4] using p=5p = 5:

  • h1=h0p + v1=05 + 1=1h_1 = h_0*p\ +\ v_1 = 0*5\ +\ 1=1

  • h2=h1p + v2=15 + 2=7h_2 = h_1*p\ +\ v_2 = 1*5\ +\ 2=7

  • h3=h2p + v3=75 + 3=38h_3 = h_2*p\ +\ v_3 = 7*5\ +\ 3=38

  • h4=h3p + v4=385 + 4=194h_4 = h_3*p\ +\ v_4 = 38*5\ +\ 4=194

So we arrive at a hash value of 194194.

In conclusion, this formula considers every element of the array, even though in different proportions. What happens if we just take the regular sum of all elements? Then all arrays containing the same set of elements, possibly arranged in a different order, will have the same hash value. However, they are not identical! This trick enables us to differentiate between such cases.

Strings

Is it true that only integers are hashable? The answer is no! In terms of strings or string arrays, everything remains equivalent, except we first convert the strings to integers. This can be attained in several ways: using the Unicode table, other coding systems, or even hash functions that accept strings and output integers. For example, let's hash the string "Hash" using ASCII code.

String

  1. Firstly, we need to convert characters to integers. One way to do it is to assign to every character its corresponding HEX ASCII code. In our case, H48,a61,s73,h68.H \to 48, \quad a \to 61, \quad s \to 73, \quad h \to 68.

  2. Notice that now we don't have to hash a string anymore, but a sequence of integers, namely [48,61,73,68][48, 61, 73, 68]. How cool is that?

  3. The rest is clear, we hash this array of integers as explained in the previous section using, for instance, p=5p=5 and q=103q=103.

  4. We end up with h("Hash")=27h(\text{"Hash"}) = 27 (try it, maybe the author is mistaken...).

Similarly, we can hash arrays of arrays: first, calculate the hash of the inner arrays, and then the hash of the array of the newly established hashes. We can go even further and calculate hash values of arrays of strings, arrays of arrays of arrays of strings, and whatever is hashable, no matter how complicated it might look.

Conclusion

In this topic, we have finally studied concrete hash functions:

  • We've learned that hash functions are typically denoted by hh, and the hash value for a specific item xx is denoted by h(x)h(x).

  • We discussed how to hash integers using the identity function or the modulo operation.

  • We explored how to hash integer arrays, using a formula that considers every element of the array.

  • We established that not only integers are hashable, but strings and string arrays can also be hashed by first converting them to integers.

  • Finally, we discovered that we can hash arrays of arrays, and even more complex structures, once we understand the basic principles.

The beauty of hash functions lies in their simplicity, yet they play a crucial role in many areas of computer science, from data retrieval to cryptography. Understanding them is a key step in becoming a proficient programmer.

18 learners liked this piece of theory. 4 didn't like it. What about you?
Report a typo