We have already discussed the concept of collisions, their negative effects, as well as some basic ways of handling them: the choice of a good hash function, the usage of a maximum load factor, and chaining. Another idea to fight collisions would be to have buckets that hold only one element, and somehow put extra ones into other empty buckets. So, we can keep a simple array for the hash table without needing other data structures such as linked lists. However, there are some new questions that may arise:
Can we quickly search for an element?
How can we delete the element fast without affecting future operations?
How do we make sure we always have space for a new element?
Linear probing
Linear probing is just the right method for this. First, we need two values that we know will never be inserted in the hash table: one to represent empty and one to represent deleted. For example, if we want to insert only integers strictly greater than 0, we can use to mean empty and to mean deleted. We will see soon why we require a different value for deleted. For now, let's look at the algorithm, describing how the new operations are performed:
Inserting: We compute the hash value and look for the corresponding bucket. If the bucket is empty or deleted we write the value there. Otherwise, we check the bucket to the right and repeat until we find an empty or deleted one. If we get to the end of the array, we go back to the start and continue.
Searching: Same as inserting: we compute the hash value and go to the corresponding bucket. If the value is present there, we are done and the value is found; otherwise we keep searching right. However, in this case, we ignore deleted and stop going to the right only if an empty bucket is met, in which case, we conclude that the value is not present in the hash table.
Deleting: Search for the value we want to delete and put deleted in that bucket.
To make sure there is always space for new elements, we can use a maximum load factor strictly less than , such as the default . We also take care that when resizing the array, we get rid of all the deleted buckets.
Deleted value in linear probing
We have mentioned more than once that a different deleted value is necessary apart from the empty one. Let's look at an example to see why is it so. Let's say we have a hash table with 4 buckets and insert in this order using the identity hash. Here has to go to bucket 1, but this bucket already has an element. Then we try bucket 2, it's also occupied. When we try bucket 3, we see that it's empty, so we can insert our element there. All in all, we get the following filled table (recall that represents empty):
Now we want to delete and then search for . Let's see what happens if we don't use a different value for deleted, but mark deleted elements by empty ():
We search in bucket 1, it's not there, so we go to the next one. Bucket 2 is empty, so we terminate, saying that is not in the hash table. Oops! But we can see quite well the in the last bucket. Now let's see how the usage of a deleted value fixes this issue:
When we get to bucket 2, we can see it's deleted, not empty, so we go further and find in bucket 3. So it worked!
Quadratic probing
Linear probing works well with a good hash function. However, in practice, sometimes we have a few values with close hash values. If we use linear probing, this means that we create some local clusters that make it slow to insert, search, or delete elements with those hash values.
Quadratic probing is a small modification to linear probing that makes sure we don't form those clusters. The operations are done just like in linear probing, except for a small detail: instead of moving one step at a time, here we use the function to tell us which bucket to consider at step . Here, and are two numbers we can choose and is the hash value of the element we want to insert. Note that, if we take , we get the same thing as linear probing. In general, as we want to avoid the problems of linear probing, we take . This is exactly where the name quadratic comes from: we have a quadratic equation for computing the bucket index.
Let's see quadratic probing in action by using a simple example. Suppose we have the following hash table that uses the identity hash:
Notice the cluster at the beginning (). Let's see what happens if we want to insert with linear probing:
The hash is , so we first look in bucket 1. It's full, so we have to go to the right. Moreover, we have to go to the right 4 times, that's half the size of the table! And now, every time we want to search for we have to take those 4 steps. So inefficient.
Let's look at how we would insert in the hash table with quadratic probing using :
First, we check bucket 1. It's full. Then we go to bucket , it's also full. Then we jump to bucket , and it's empty, so we insert there. We skipped 2 whole buckets!
Malfunctions of quadratic probing
So quadratic probing seems to work well. However, let's introduce a more complex search, so we have to make sure everything still works as intended. Let's say we have the following hash table that uses identity hash and :
Let's say we want to search for the value . We first look in bucket 0, and then, if you do the math using the formula for , we will be going to bucket 2, then back to 0, back to 2, and so on. So, what is happening here? Why do we go round in circles? A simple observation shows that:
For even steps , the value of will always be . Indeed, .
Similarly, for odd steps , the value of will always be . Any doubt about this? Here are the calculations:
Recall that the value of is always taken modulo , since is the total number of buckets.
Back to the search of , we see that it will loop forever by checking only the buckets 0 and 2!
The conclusion: not all functions take us through the whole table, and we have to ensure scenarios like the one above do not happen! How? Take a look at the section below.
Picking constants
As we mentioned in the previous section, when it comes to quadratic probing, choosing the right constants is essential to ensure an effective collision resolution strategy. The constants used in quadratic probing determine the sequence of positions to probe when a collision occurs. Here are some considerations for selecting the constants:
Prime Numbers: It is recommended to use prime numbers for the constants in quadratic probing. Prime numbers help to achieve a more even distribution of probed positions, reducing clustering and the likelihood of entering into cycles.
Size of Hash Table: The constants should be chosen in a way that ensures they do not exceed the size of the hash table. This ensures that the probing sequence stays within the bounds of the table and covers all possible positions without going out of range.
Experimentation: It's often beneficial to experiment with different constant values to find the ones that work best for a specific hash table size and data distribution. By evaluating the performance and collision resolution effectiveness for different constant values, you can determine the optimal choice.
Libraries and Resources: Many programming languages and libraries provide default or recommended constant values for quadratic probing. It can be helpful to consult these resources or refer to existing implementations to leverage established best practices.
Conclusion
In conclusion, we have explored collisions and their negative effects, as well as two important collision resolution techniques: linear probing and quadratic probing. Linear probing involves sequentially searching for the next available bucket, while quadratic probing uses a quadratic function to determine the next bucket to probe.
By carefully selecting suitable constants, such as prime numbers, for quadratic probing, we can achieve a more balanced distribution of values and reduce clustering. These techniques allow us to handle collisions effectively and optimize data management. Keep experimenting and implementing these strategies to ensure efficient and reliable data retrieval.