Computer scienceProgramming languagesJavaWorking with dataCollectionsCollection Implementations

HashSet internals

16 minutes read

HashSetclass is an implementation of a mathematical set that is used to store unique elements. We will tell you about its effectiveness, introduce you to frequently used methods, and explain its internal structure.

Internal structure

Based on HashMap, the class HashSet is the topic we will now cover in detail. Inside the HashSet class is a private HashMap<E, Object> map field instantiated when calling the HashSet constructor – all elements stored as keys. When you add an element to a HashSet, you add it to an internal HashMap. Moreover, when you iterate over HashSet elements, you are iterating over HashMap's keySet(). Let's analyze the add() method source code seen below:

public boolean add(E e) {
    return map.put(e, PRESENT) == null;
}

The add() method accepts an object, puts a key-value pair to the HashMap, and checks the return value of that operation. If null, there was no pair with such a key in a HashMap, and the add() method returns true. If a key is present, the add() method returns false, meaning the element will not add to the set.

Studying the code above, you might have asked yourself a question. We pass only one object to the add() method, passed to the HashMap as a key, but where did the PRESENT variable come from? The answer is there is another HashSet field: private static final Object PRESENT = new Object(). It is used as a dummy object in the internal HashMap as the value of the entry. This relationship is illustrated below.

HashSet diagram

HashSet and user-defined classes

We will examine how user-defined data types are used in a HashSet to guarantee accurate, dependable, and efficient results. These data types, when created, declare their behavior and exhibit unique characteristics. Unlike wrapper data types, user-defined types usually have multiple fields, giving newly created objects distinguishable attributes of their own, depending on their constructor initialization. For this reason, we must take extra precautions when working with user-defined types and HashSet, as we are no longer just working with a field in a wrapper type.

As we know, the Object class is the root of the class hierarchy; therefore, Object is the superclass (parent) of all derived classes. Class Object has two methods crucial to how HashSet works when dealing with user-defined types. These are equals() and hashcode(), which combine to make object comparisons in harmony. As defined in class Object (parent), these two methods do not always meet the requirement to check whether two or more objects (child) have the same values. The reason is inadequate methods implementation when you invoke Object default methods. The solution is to override the equals and hashCode methods. If you don't override these two methods, you are invoking the default methods of the Object (parent) class, giving you inefficient results or, worse, possibly corrupting HashSet operations.

Let's look at some potential issues that could occur using HashSet. The user-defined class, Person, provided below is used here to illustrate how HashSet relies on the correct overriding of methods equals() and hashCode(). We can consider the content of class Person to be its two fields: String name and Integer id. As we will see shortly, HashSet should rely on the content – name, and age in Person – of elements to check equality. However, the default equals() method does not do this, as it only checks the references of two objects to verify their equality. This means that two objects are equal only if they refer to the same memory location. We want to check equality by comparing the logical content of each element: name and age. The hashCode method needs to be overridden to satisfy the contract — if two objects are equal according to the equals() method, then hashCode should generate the same hash code. The Person class listed below – with no overridden methods – is used in the code examples.

class Person {
    private String name;
    private Integer age;
    
    public Person(String name, Integer age) {
        this.name = name;
        this.age = age;
    }

    // Getter and Setter    

    @Override
    public String toString() {
        return "Person{" +
                "Name= '" + name + "' " +
                "Age= '" + age + "'}";
    }
}

Let's use the program below to add two elements to a HashSet — persons. Here we want to illustrate the significance of correctly implemented equals() and hashCode() methods when using HashSet. Since unique elements are inserted into a HashSet, we must be sure our equality checks are accurate. We don't want to add duplicate elements. Person references p1 and p2, creating two objects with the same logical information. When checking equality, we expect these to be equal since they have the same name and age. Remember that Person currently uses the default equals() and hashCode() methods. Notice that p1 and p2 are logically equal.

import java.util.*;

public class HashSetDemo {
    public static void main(String[] args) {

        Person p1 = new Person("Gosling", 68);
        Person p2 = new Person("Gosling", 68);

        Set<Person> persons = new HashSet<>();
        persons.add(p1);
        persons.add(p2);

        System.out.println("hashCode: " + p1.hashCode() +
                                    " " + p2.hashCode());
        System.out.println("p1.equals(p2): " + p1.equals(p2));        
        System.out.println(persons);  
    }
}

As a result, running this program provides us with the following output – see below. Is this what we want? The answer is no. Let's analyze our current scenario to understand what is going on. The output shows that p1 and p2 are not equal, and neither are their hash codes. So, with the current equality check, they are incorrectly added to the HashSet. Visually comparing the fields' name and age, we see the contents are the same; therefore, only one person should have been added to the set.

hashCode: 312116338 453211571
p1.equals(p2): false
[Person{Name= 'Gosling' Age= '68'}, Person{Name= 'Gosling' Age= '68'}]

Your hash codes may show different values.

The default equals() method is shown below.

public boolean equals(Object obj) {
        return (this == obj);
}

Here the default equals() method is essentially comparing two memory addresses. There is no check for any unique content objects may contain, such as in fields name and age. We need to override the equals() method to remedy this undesirable behavior. Keep in mind that our goal here is to have a dependable HashSet. Let's look at an equals() method containing additional code for more comprehensive checks.

@Override
public boolean equals(Object o) {
    if (this == o) {
        return true;
    }
    if (o == null || getClass() != o.getClass()) {
        return false;
    }
    Person person = (Person) o;
    return name.equals(person.name) && age.equals(person.age);
}

First, as in the default method, the current object instance is compared with the passed object. Next, we check for a null value and also whether the two classes are not equal. Finally, we cast the passed object to Person and return the comparison to determine the object's equality based on the logical fields name and age.

Let's run the HashSetDemo program with the overridden equals() method in class Person to see how it affects the HashSet result. Our new equals() method now correctly compares p1 and p2 as true. However, the hash codes were not affected as they were still checked as not equal. Also, we are still adding two logically identical entries to the HashSet — not what we want.

hashCode: 312116338 453211571
p1.equals(p2): true
[Person{Name= 'Gosling' Age= '68'}, Person{Name= 'Gosling' Age= '68'}]

Your hash codes may show different values.

What we have left to do with the class Person in order to reliably use it with a HashSet — and indirectly HashMap — is override its hashCode method. Keep in mind that if two objects are equal when using the equals() method, then their hash codes — using hashCode method — must generate the same integer value. Always override hashCode() whenever you override equals(). When writing a good hashCode method, we should strive to produce different hash codes for unequal instances. This will improve HashSet's performance.

Let's use the hashCode() method in class Person below to see its effect on HashSetDemo. Here we are using Objects.hash(name, age) to return a hash code. The advantage of this approach is that we can use an arbitrary number of fields to calculate a hash code for an object. In our example, we are using name and age from class Person. This approach has an average performance and can be used in routine applications, but you can write your own or use your IDE to generate one with better performance.

@Override
public int hashCode() {
return Objects.hash(name, age);
}

After adding the overridden hashCode method to the class Person and running HashSetDemo, we get the result below. Notice that this time, the hash codes are the same, and p1 and p2 are equal, and finally, the set only has one entry — as expected. We can now see that to have a reliable result using HashSet, it is crucial to override the equals() and hashCode() methods of user-defined classes that we provided to HashSet.

hashCode: -1985493692 -1985493692
p1.equals(p2): true
[Person{Name= 'Gosling' Age= '68'}]

Your hash codes may show different values.

In the preceding example, we witnessed the need to correctly implement equals() and hashCode methods within a class to be used with the HashSet. These methods were designed to be overridden; they have a general contract that must be kept by the class doing the overriding. Class HashSet depends on accurate equality checks, and its efficiency depends on good hash functions implemented in hashCode.

HashSet time complexity

Remember that HashSet is based on a HashMap backed by a hash table.

The illustration below shows the results of a hash function when interacting with a hashSet — this is considered a good result. Let's add Person objects to a hashSet, each with its unique hash code. Here, we can expect a time complexity of O(1) due to its uniform distribution across the table indexes. This constant-time performance is true for basic HashSet operations — add, remove, contains, and size. If we have no collisions, then element insertion and retrieval remain constant. As we fill the table, collisions increase proportionally to the load — the number of elements divided by the number of buckets. With increasing collisions, we have more table look-ups, which lowers performance towards O(n) – linear-time. To prevent HashSet performance from degrading, rehashing increases the table size and then copies all the elements to this larger table. This occurs when the load reaches a specified threshold – it's load factor. Since Java 8, the collision resolution mechanism changes the structure to a balanced tree, providing a logarithmic-time complexity of O(log(n)). The table below shows that each object has a unique index (bucket) and hash code.

The following example below illustrates the result of a badly implemented hashCode method that always returns int 24. As we can see, all the Person objects end up in the same index (bucket 8) as a linked list. Here we see the worst-case time complexity behavior of O(n). We must implement a good hashing function to achieve the best-case complexity of O(1).

Object with the same hash code

Constructor tuning

HashSet has two advanced constructors, as seen below. Basic constructors are found in the Introduction to HashSet topic.

HashSet(int initialCapacity)
HashSet(int initialCapacity, float loadFactor)

Both constructors here create an empty set with a backing HashMap instance. When using the first, you specify the initialCapacity; the default load factor is (0.75). The second version allows you to set the loadFactor and the initialCapacity. It is usually not necessary to set these configuration parameters as the basic constructors fulfill most requirements you may have. Finding the correct balance can be tricky and should be carefully implemented. Experimenting with these parameters can be a helpful learning tool.

Conclusion

To understand how class HashSet works internally, we must always remember that it depends on class HashMap, which is backed by a hash table. When you add an element to a HashSet, you add it to an internal HashMap. The key to understanding HashSet internals lies in your grasp of class HashMap.

When HashSet has to deal with user-defined classes as extensions of class Object, you have to ensure that the Objects (user-defined) obey certain methods contract, namely, equals() and hashCode(). HashSet depends on the correct overriding of these two methods in order to function properly. As a rule, you must override hashCode() when you override equals(). Unequal hash codes should be assigned to unequal user-defined objects as best as possible.

The time complexity of HashSet basic operations – add, remove, contains, and size – is O(1), with a good hashing function when used with user-defined objects. As collisions increase, the time complexity degrades toward the O(n) spectrum. Java 8 introduced a balanced tree to better resolve collisions. This algorithm provides a much faster time complexity of O(log(n)).

16 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo

HashSet internals

Internal structure

HashSet and user-defined classes

HashSet time complexity

Constructor tuning

Conclusion

Related topics