Hashing is an intriguing concept that has a big impact on how we manage data efficiently. It is like a magic trick that simplifies complex information into neat compartments, making it easier to find what we need quickly. However, collisions might happen when different pieces of information accidentally end up in the same compartment, causing some trouble when we try to retrieve our data. In this article, we'll explore the challenges of collisions and discover clever strategies to handle them effectively.
Collisions are not cool
In the realm of hash functions, collisions occur when different inputs produce the same hash value. It's like having two different keys that unlock the same door. Collisions are unintended clashes that arise due to the finite range of hash values and the potentially infinite set of input data. They can happen in various applications that utilize hash functions, including hash tables, data encryption, checksums, and digital signatures.
Understandably, collisions present several drawbacks and challenges:
-
Hash table performance: As we already know, collisions can increase the time required for operations that rely on hash tables, such as searching, indexing, and data retrieval. This can slow down system performance and efficiency, which is definitely not good since efficiency is the main purpose of such tables.
-
Loss of uniqueness: Collisions occur when different inputs produce the same hash value, making it difficult to differentiate them solely based on their hash values. This can lead to data integrity issues, as distinct inputs may be mistakenly identified as identical.
-
Security implications: Collisions can be exploited by malicious actors to bypass security measures. This creates vulnerabilities in systems that rely on hash functions for authentication or data protection.
This is exactly why our main goal is to minimize the number of collisions as much as possible. Given that for cryptographic hashes, the situation with collisions is mainly resolved by choosing the right function, it does make sense to primarily focus on hash tables in the rest of the topic.
A good hash function
Quite often, the values we want to add to a hash table are not random. It may happen that, due to their properties, some hash values will appear more frequently. Let's take a look at the example.
We want to store a hash table of prime numbers larger than 2 using the hash function . If the numbers to store are random, the hash values are distributed uniformly. Let's see what happens when we add :
| Bucket | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| Values | {} | {5, 13} | {} | {7, 11, 19, 23} |
Those numbers don't seem uniformly distributed! As all prime numbers larger than 2 are odd, we will never have anything in buckets 0 and 2. Let's see what happens if we pick a different hash function, like :
| Bucket | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Values | {5} | {11} | {7} | {13, 23} | {19} |
Much better! Choosing a good hash function is essential for a fast hash table, but hash functions almost always produce collisions. Moreover, finding a good hash function is often not easy, and there is no standard way to find one.
Another way of reducing and handling collisions even without perfect hash functions is keeping a max load factor . We can recall, that when the load factor of a hash table gets greater than , we increase the number of buckets. Obviously, this is far better than adding more elements to existing buckets, which would further increase the number of collisions.
Chaining
Until now, we tried to reduce the number of collisions as much as possible. But it still doesn't guarantee that each element will fall into a different bucket. Now, let's look at some ways to handle collisions when they appear.
The most natural thing to do is to store the buckets as linked lists: this is called chaining. When you add a new element to a bucket, you check all values in the list to see if any of them are equal to the new one. If none of them is equal to the new element, add it to the end of the list. Searching and removing are similar. All operations take linear time in the size of the linked list, so it is still important to have small buckets. Let's look at an example where we ignore the load factor.
You have a hash table with 4 buckets and insert in this order using the identity hash. Recall that if the hash value is greater than the maximum index of buckets , we take the hash value modulo (the number of buckets):
We can see that there are still collisions, right? Let's see what this looks like if we use a maximum load factor of .
After we insert the 4th element, i.e. , we will need to double the number of buckets and insert everything again. Why is it so? Recall that our hash table has buckets and a maximum load factor equal to . This means that when gets greater than , we will have to double the number of buckets. Until the third element, we are ok, since is not greater than . However, when inserting the fourth element, the load factor becomes , which is obviously greater than the maximum load factor . That is why we double the number of buckets and insert everything again. Keep in mind that for large hash values we don't use modulo , but modulo instead, since the number of buckets is now ). We end up with the hash table below (try it!):
Much better, now all buckets have 1 element at most! The chaining method is easy to implement, and it works well. The only disadvantage is that linked lists take quite a bit of extra space and, if the hash function isn't very good, a lot of the buckets will be empty but still take space.
Conclusion
Now we've learned about collisions and how they affect the performance of hash tables. We looked at ways to reduce the number of collisions by picking a good hash function for the values we have and by keeping a maximum load factor of the table. We also looked at ways to handle collisions when implementing hash tables, like chaining, and we are ready to explore more ways in the upcoming topics. In all of those, we saw how important a good hash function is! So, even though in theory hash tables are really fast, it all comes down to the choice of the hash function.