Computer scienceAlgorithms and Data StructuresAlgorithmsPrinciples and techniquesHashing

Collision handling: probing

14 minutes read

We have already discussed the concept of collisions, their negative effects, as well as some basic ways of handling them: the choice of a good hash function, the usage of a maximum load factor, and chaining. Another idea to fight collisions would be to have buckets that hold only one element, and somehow put extra ones into other empty buckets. So, we can keep a simple array for the hash table without needing other data structures such as linked lists. However, there are some new questions that may arise:

  • Can we quickly search for an element?

  • How can we delete the element fast without affecting future operations?

  • How do we make sure we always have space for a new element?

Linear probing

Linear probing is just the right method for this. First, we need two values that we know will never be inserted in the hash table: one to represent empty and one to represent deleted. For example, if we want to insert only integers strictly greater than 0, we can use 00 to mean empty and 1-1 to mean deleted. We will see soon why we require a different value for deleted. For now, let's look at the algorithm, describing how the new operations are performed:

  • Inserting: We compute the hash value hh and look for the corresponding bucket. If the bucket is empty or deleted we write the value there. Otherwise, we check the bucket to the right h+1h+1 and repeat until we find an empty or deleted one. If we get to the end of the array, we go back to the start and continue.

  • Searching: Same as inserting: we compute the hash value and go to the corresponding bucket. If the value is present there, we are done and the value is found; otherwise we keep searching right. However, in this case, we ignore deleted and stop going to the right only if an empty bucket is met, in which case, we conclude that the value is not present in the hash table.

  • Deleting: Search for the value we want to delete and put deleted in that bucket.

To make sure there is always space for new elements, we can use a maximum load factor strictly less than 11, such as the default 0.750.75. We also take care that when resizing the array, we get rid of all the deleted buckets.

Deleted value in linear probing

We have mentioned more than once that a different deleted value is necessary apart from the empty one. Let's look at an example to see why is it so. Let's say we have a hash table with 4 buckets and insert {1,2,5}\{1, 2, 5\} in this order using the identity hash. Here 55 has to go to bucket 1, but this bucket already has an element. Then we try bucket 2, it's also occupied. When we try bucket 3, we see that it's empty, so we can insert our element 55 there. All in all, we get the following filled table (recall that 00 represents empty):

Example with a linear Probing, beginning

Now we want to delete 22and then search for 55. Let's see what happens if we don't use a different value for deleted, but mark deleted elements by empty (00):

Example with a  linear Probing, continued

We search in bucket 1, it's not there, so we go to the next one. Bucket 2 is empty, so we terminate, saying that 55 is not in the hash table. Oops! But we can see quite well the 55 in the last bucket. Now let's see how the usage of a deleted value fixes this issue:

Example with a  linear Probing, completion

When we get to bucket 2, we can see it's deleted, not empty, so we go further and find 55 in bucket 3. So it worked!

Quadratic probing

Linear probing works well with a good hash function. However, in practice, sometimes we have a few values with close hash values. If we use linear probing, this means that we create some local clusters that make it slow to insert, search, or delete elements with those hash values.

Quadratic probing is a small modification to linear probing that makes sure we don't form those clusters. The operations are done just like in linear probing, except for a small detail: instead of moving one step at a time, here we use the function H(i)=h+ai+bi2H(i) = h + a*i + b*i^2 to tell us which bucket to consider at step ii. Here, aa and bb are two numbers we can choose and hh is the hash value of the element we want to insert. Note that, if we take a=1,b=0a = 1, b = 0, we get the same thing as linear probing. In general, as we want to avoid the problems of linear probing, we take b1b \geq 1. This is exactly where the name quadratic comes from: we have a quadratic equation for computing the bucket index.

Let's see quadratic probing in action by using a simple example. Suppose we have the following hash table that uses the identity hash:

Example with a quadratic probing

Notice the cluster at the beginning ({1,2,3,4}\{1, 2, 3, 4\}). Let's see what happens if we want to insert 99 with linear probing:

Insert with a  linear probing

The hash is 99, so we first look in bucket 1. It's full, so we have to go to the right. Moreover, we have to go to the right 4 times, that's half the size of the table! And now, every time we want to search for 99 we have to take those 4 steps. So inefficient.

Let's look at how we would insert 99 in the hash table with quadratic probing using H(i)=h+i2H(i) = h + i^2:

Insert with a   quadratic probing

First, we check bucket 1. It's full. Then we go to bucket 1+12=21 + 1^2 = 2, it's also full. Then we jump to bucket 1+22=51 + 2^2 = 5, and it's empty, so we insert 99 there. We skipped 2 whole buckets!

Malfunctions of quadratic probing

So quadratic probing seems to work well. However, let's introduce a more complex search, so we have to make sure everything still works as intended. Let's say we have the following hash table that uses identity hash and H(i)=h+2i2H(i) = h + 2*i^2:

Another example with a quadratic probing

Let's say we want to search for the value 1616. We first look in bucket 0, and then, if you do the math using the formula for H(i)H(i), we will be going to bucket 2, then back to 0, back to 2, and so on. So, what is happening here? Why do we go round in circles? A simple observation shows that:

  1. For even steps ii, the value of H(i)H(i) will always be 00. Indeed, H(2k)=16+2(2k)2=16+8k20 mod 8H(2k) = 16 + 2*(2k)^2 = 16 + 8k^2 \equiv 0 \ mod\ 8.

  2. Similarly, for odd steps ii, the value of H(i)H(i) will always be 22. Any doubt about this? Here are the calculations: H(2k+1)=16+2(2k+1)2=16+2(4k2+4k+1)=8(2+k2+k)+22 mod 8.H(2k+1) = 16 + 2 * (2k+1)^2 = 16 + 2 * (4k^2 + 4k +1) = 8*(2 + k^2 +k) + 2 \equiv 2 \ mod \ 8.

Recall that the value of H(i)H(i) is always taken modulo 88, since 88 is the total number of buckets.

Back to the search of 1616, we see that it will loop forever by checking only the buckets 0 and 2!

The conclusion: not all functions HH take us through the whole table, and we have to ensure scenarios like the one above do not happen! How? Take a look at the section below.

Picking constants

As we mentioned in the previous section, when it comes to quadratic probing, choosing the right constants is essential to ensure an effective collision resolution strategy. The constants used in quadratic probing determine the sequence of positions to probe when a collision occurs. Here are some considerations for selecting the constants:

  1. Prime Numbers: It is recommended to use prime numbers for the constants in quadratic probing. Prime numbers help to achieve a more even distribution of probed positions, reducing clustering and the likelihood of entering into cycles.

  2. Size of Hash Table: The constants should be chosen in a way that ensures they do not exceed the size of the hash table. This ensures that the probing sequence stays within the bounds of the table and covers all possible positions without going out of range.

  3. Experimentation: It's often beneficial to experiment with different constant values to find the ones that work best for a specific hash table size and data distribution. By evaluating the performance and collision resolution effectiveness for different constant values, you can determine the optimal choice.

  4. Libraries and Resources: Many programming languages and libraries provide default or recommended constant values for quadratic probing. It can be helpful to consult these resources or refer to existing implementations to leverage established best practices.

Conclusion

In conclusion, we have explored collisions and their negative effects, as well as two important collision resolution techniques: linear probing and quadratic probing. Linear probing involves sequentially searching for the next available bucket, while quadratic probing uses a quadratic function to determine the next bucket to probe.

By carefully selecting suitable constants, such as prime numbers, for quadratic probing, we can achieve a more balanced distribution of values and reduce clustering. These techniques allow us to handle collisions effectively and optimize data management. Keep experimenting and implementing these strategies to ensure efficient and reliable data retrieval.

4 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo