Computer scienceAlgorithms and Data StructuresAlgorithmsPrinciples and techniquesHashing

Hashing

10 minutes read

Imagine having a large room full of books, and you need to locate a specific book swiftly. If you create a system where each book has a unique code that indicates its location, you'll be able to find the book faster. In computer science, this process is called hashing. It assigns each data piece a unique code, known as a hash, allowing you to find the data quickly when needed. Because it's frequently used in programming, it's essential to learn about hashing and data structures that use this method!

Understanding hashing and hash functions

So, what exactly is hashing? It's a method that transforms any size of input into a fixed-size output, often a string of text or a number. This output is known as a hash or a hash code. Hashing is used in many areas of computer science, such as data retrieval, cryptography, and data integrity checks. In basic terms, hashing is like generating a unique code for every piece of data.

But how do you create a hash? A special breed of functions, known as hash functions, are used just for this. These functions accept an input and return a fixed-size string of bytes. For example, consider the below function:

function hash(text): 
  result = 0 
  for symbol in text: 
    result += int(symbol) 
  return result

This function receives text as an input and calculates the sum of the numeral equivalent for each symbol in it. However, this function carries a significant flaw: it can generate the same hash code for different inputs. For example, think about the values $abcd$ and $dcba$ . They are drastically different but are made of matching symbols, resulting in the function outputting the same result. So why is this a problem? Picture searching for a book in a library: if several books share the same code, you'd have to go through all of them until you find the right one, which is time-consuming. The scenario where two diverse inputs produce the identical output is known as a collision. When creating a hash function, you should do all you can to prevent collisions.

All right, we've looked at a flawed hash function. But can we improve it? It turns out there are several characteristics of a good hash function:

The output must be deterministic, meaning the same input will always generate the same output. Hash functions can be very simple.
The function must be efficient. Hashing isn't usually used in isolation; it's often a fundamental element of a data structure. Consequently, if it's inefficient, the entire structure's performance will suffer.
Collisions should be minimized. As we'll discuss later, the more collisions a hash function produces, the slower various data structures operate. That's something you'll want to avoid.

Hash tables and collision resolution

You've now learned about hash functions. But what's their practical application? It turns out, one data structure, called a hash table, relies on the use of these functions. A hash table enables the storage of key and value pairs to link the two together. A real-world example of a hash table could be a dictionary: a word in one language pairs with its translation in another, allowing you to find the necessary information easily.

This principle is a cornerstone in computer science. For example, if you're building an app that quizzes users on countries and their capitals, you might opt to create a hash table to store these pairs. It might look like this:

But how does it work, and how does hashing factor into it? You might be surprised to find out that hash tables use a basic array underneath! So, when a new hash table is created, an array is stored inside it. Each time a new pair gets added to the table, the key's hash code gets calculated. Following this, the value linked to the key is stored in the array at the index, corresponding to the key's hash code. If you were to visualize this process, it would look something like this:

The array cells are referred to as buckets. Each bucket can hold one key-value pair or, in the case of separate chaining (explained below) – a single linked list.

That's it! This arrangement considerably simplifies data retrieval: input the key, compute the hash code, and get back the value stored in the cell associated with that index. However, as you'll recall, sometimes collisions can occur. In this data structure, resolving those collisions is extremely important: if you choose a hash function that often leads to collisions, adding and retrieving data can become more difficult. Even with a high-quality function, the infinite number of potential keys and limited quantity of hash codes can still lead to problems. That's why it's vital to have a method for resolving these collisions.

Several methods of collision resolution are available:

Separate chaining. In this method, each cell in the hash table leads to a linked list of records that share the same hash function. When a collision happens, the new item is added to the end of the list. Let's say a pair $Turkey–Ankara$ is added to our hash table. Assume that $Turkey$ has a hash code of 82. But there's already one key with that code in our table – $Brazil$ . Therefore, $Ankara$ will be appended to $Brasilia$ , which is already in the corresponding cell. Now, when you want to access a value using a key, you must go through the list until you find the corresponding key. As you can see, this can substantially slow down the performance of the hash table.
Open Addressing. In this method, when a collision occurs, we hunt for another open spot in the array to put the item. There are a few ways to do this:
- Linear Probing. If a collision occurs, you check the next spot, then the next, and so on, until you find an open one.
- Quadratic Probing. This is similar to linear probing, but instead of checking the next spot, it checks a cell a certain distance away. The distance increases quadratically (1, 4, 9, 16, etc.) until an open slot is located. So, if we wanted to add a key with a hash code of $209$ to our table, we would see that the corresponding cell is already occupied. Hence, you would try to situate the value in a cell coded at $209 + 1^2 = 210$ . However, it is also full! Now, you try putting it into the cell $209 + 2^2 = 213$ and so on.
- Double Hashing. This uses a second hash function to determine the step size when a collision occurs. So, if you tried adding a key with a hashcode of $208$ to a table and found it is occupied, you would apply another hash function and receive another hash code for this key. Let's say it returns the value $2$ . Now, you add these two hash codes together to get $208 + 2 = 210$ . However, this cell is already occupied by another value! So, you add the result of the second hash function again: $208 + 2 + 2 = 212$ and so on.
Rehashing. When the hash table reaches a certain load factor (calculated as ${number\ of\ elements\ in\ a\ hash\ table \over size\ of\ a\ hash\ table}$ ), a new, larger hash table is created, and all the elements are re-inserted into it. This practice can help to decrease collisions but requires additional memory and computational power when rehashing takes place.

It's important to remember that even though there are many ways to resolve collisions, you should try to avoid them as much as possible. Why? When collisions crop up, you're required to carry out extra operations that impact the performance of hash table-based data structures. You'll have to visit several spots to find the necessary key or calculate numerous hashes in cases like separate chaining and open addressing. In rehashing, you need more memory and must invest time in transferring the values from the old hash table to the new one.

Time complexity of hash operations

It's time to discuss the complexity of basic operations with a hash table. This element is highly appreciated for its efficiency, and you'll understand why. Let's look at the time complexities of the key operations:

Insertion. Generally, the time complexity for inserting an element into a hash table is $O(1)$ , or constant time. That's because a value is deposited right into a memory location. But in a worst-case scenario, when every key causes a collision, the time complexity can escalate to $O(n)$ , with n being the number of keys.
Deletion. Deleting an element from a hash table typically has a time complexity of $O(1)$ . Yet again, if all keys lead to collisions, the time complexity can ramp up to $O(n)$ .
Retrieval (Search). The time complexity for retrieving a value from a hash table, using its key, is typically $O(1)$ . However, if all keys gravitate to the same index in a worst-case scenario, it could climb up to $O(n)$ .

A good hash function ensuring a uniform distribution of keys can prevent the worst-case scenario. However, it's vital to recognize its possibility, especially if the keys follow a particular pattern or the hash function is not sophisticated enough to manage the distribution of your specific keys.

Also, these time complexities rest on the assumption that the hash table is maintained at a reasonable load factor. If it becomes overly populated (the load factor gets too high), it might need resizing. That is an $O(n)$ operation. However, since resizing isn't frequent, we spread this cost over the many $O(1)$ insertions, allowing us to consider the insertion to be an $O(1)$ operation on average.

Hash table-based data structures

You already understand that a hash table's operations are incredibly efficient. As a result, data structures that rely on it are common in computer science. Here are a couple data structures that use hash tables:

Hash Set. A hash set is a data structure that provides set functionality. It executes operations like insert, delete, and search in average constant time $O(1)$ . It uses a hash table as the foundational data structure. Each value is hashed and stored at the corresponding index in the table. Since sets only contain unique elements, the set must address collision scenarios - where two different elements hash to the same index - through strategies like chaining or open addressing.
Hash Map (or Hash Table). A hash map is a data structure that provides map functionality. It enables you to save key-value pairs, utilizing a hash function to disperse these pairs across an array. Like a hash set, it executes insert, delete, and search operations in average constant time $O(1)$ . It also needs a strategy to handle collision scenarios.

These are just a couple of ways hash tables are used in data structures. The key advantage of using hash tables is their ability for constant time operations on average. However, this can greatly depend on the hash function's quality and the collision resolution method.

Conclusion

As a summary, let's review the key concepts of hashing:

Hashing is an operation that transforms input of any size into a fixed-size output. This output can be a string of text or a number. Hash functions are important; they accept the input and yield a static-size string of bytes. This output is known as a hash code or simply, a hash. For reliability, the output from a hash function should be consistent, which means the same input will always yield the same output.
A collision occurs when two different inputs result in the same output. Excessive collisions can significantly degrade the performance of a hash table; thus, it's crucial to devise a proper collision resolution strategy.
Collision resolution can be managed in a few ways: separate chaining, which keeps a list of key-value pairs that share the same hash; open addressing that locates another open slot for the value, and rehashing that generates a larger hash table reinserting all the values into it.
The primary tasks for a hash table are insertion, deletion, and retrieval (search). On average, their time complexity is $O(1)$ and reaches $O(n)$ in the worst-case scenario.
Hashing is an efficient method employed in many data structures. The most common are hash sets and hash maps.

5 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo