Computer scienceAlgorithms and Data StructuresAlgorithmsString coding algorithmsData compression

Huffman coding

14 minutes read

While installing software for your PC you might have noticed something strange. The installed software actually takes more storage than the installation file itself! The installation file needs not only to contain the software but also additional files and programs that help install the software. So how is this possible that the installation file takes less space than the installed software?

The answer is very simple. The software contained in the file was compressed to take less space than it should have. And when we install the software, it gets decompressed and takes more space.

So how do we compress data? Well, we use algorithms! They compress our data when we need to and decompress when we want our original data.

In this topic, we will learn about a compression algorithm called Huffman Coding. We'll explore why it's used, understand the underlying algorithms, and discover how Huffman coding compresses data.

Lossy compression vs lossless compression

Before we delve further into what Huffman Coding actually is and how it works, we need to understand the difference between Lossy and Lossless Compression.

Imagine you have a picture that you want to make smaller.

If you use an algorithm to compress the picture in such a way that the picture doesn't lose any details. So, no data was lost during compression. This type of compression is known as lossless compression. Lossless compression maintains the quality of the picture while still reducing the storage requirements.

However, there is a limit to how much space you can save with lossless compression. If you want to save even more space you can discard some data during the process of compression. You still have the picture and you have compressed it to a much greater degree but you lose some data in the process. This type of compression is referred to as lossy compression. Have you ever noticed that images that you send through messaging apps turn more blocky and the more you send the same image the blockier it gets? This is an example of lossy compression.

Huffman Coding is a popular compression algorithm because of how it can compress data without any loss.

Working mechanism

So how does Huffman Coding compress data without losing any information?

Let's imagine a Multiple choice test with 50 questions and A, B, C, and D as the possible answers.

To store the answers we would need to store a sequence of A/B/C/D's where each character would require 1 byte of storage space. So 50 answers would require a total of 50 bytes to store.

To compress this set of answers we first need to count the number of times each option is correct. The option that repeats more will have a shorter code when compressed.

Based on the frequency of correct options we create a special tree called the Huffman Tree. This tree is like a binary tree that contains all the options where the more frequent options appear at the top and less frequent options are at the bottom, which means the length of the encoding word will be shorter for frequent characters and longer for rare ones. We will discuss how to create a Huffman tree later down the line.

After creating the tree we assign 0's and 1's to the branches, moving down the tree we assign the 0's and 1's to all the levels. This will allow us to obtain a sequence of 0's and 1's for each end of the tree that represents a unique option. One thing to note, there is no rule for the assignment of 0's and 1's the same sequence can have multiple different Huffman codes/trees that are all valid.

Non-deterministic nature

After the unique sequences are obtained we can create a table that contains the list of all options and the sequence that represents them. We can refer to this table as our codebook.

Option

Huffman Code

A

0

B

10

C

110

D

111

This codebook can now be used to encode or decode our list of options. Replacing the code with the options and vice versa.

Do note that this tree and table only work for the sequence it was made for. If the answer is changed then we would need to create a completely new Huffman tree for the new scenario.

How to build a Huffman tree

Building a Huffman Tree seems a bit tricky at the beginning but let us look at the steps one by one to make it simpler.

Step 1: First, we need to know how often each letter or symbol appears in our message. We count the frequency of each letter. For example, for our test "A" is correct 22 times, "B" is correct 15 times, "C" 10 times, and "D" is correct 3 times. Since we have the frequencies we can now start creating individual nodes for each letter. These nodes are like puzzle pieces that we will use to build our Huffman tree.

Nodes to build Huffman tree

Step 2: Next, we compare the frequencies of the nodes. We take the two nodes with the lowest frequencies and combine them into a new parent node. If more nodes have the same frequency pick any 2 nodes you want. The combined frequency of the parent node is the sum of the frequencies of its children. Here "C" and "D" are combined as they have the lowest frequencies of 10 and 3 respectively and their combined frequency is 13 which becomes the frequency of their parent node.

Building the tree: step 2

Step 3: Now we compare the remaining frequencies for the tree. The combined tree has a frequency of 13 and "B" has a frequency of 15 – they are the two nodes with the lowest frequencies. Which we can now fuse. This step is repeated till we get a single tree that contains all nodes.

Building the tree: step 3

Huffman tree

Step 4: As we build the tree, we assign a "0" to the left branch and a "1" to the right branch. This creates a binary code for each letter. The codes for the more frequent letters will be shorter, and the codes for less frequent letters will be longer. This is our final Huffman tree.

Step 4

Once we have the Huffman tree, we can use it to encode and decode messages. To encode a message, we replace each letter with its corresponding code from the tree. To decode a message, we start at the root of the tree and follow the 0's and 1's until we reach a letter.

The non-deterministic nature of the Huffman tree

You already know that there are no rules for assigning 0's and 1's for the branches of the Huffman tree. So, depending on how the 0's and 1's are placed the same tree could result in multiple different Huffman codes that are all valid and correct. Let's look at an example to show this.

Same tree as above

Both Huffman trees above are valid even though the placements for the nodes were changed, and while the trees produce the different Huffman codes the compression they provide will still be the same.

But we are still not done yet there is another scenario where multiple nodes have the same frequency! When multiple nodes have the same frequency the nodes can be placed in any combination as long as the other steps are followed. This allows for more trees and codes to be valid. For example, let's look at four Huffman trees produced for the sequence "AAAABBBBCCCCDDDD".

Tree 1

Tree 2

Tree 3

Tree 4

Here, all four trees produce different Huffman codes but each tree is a valid Huffman tree for the sequence and they provide the same rate of compression as well.

Using the 2 cases mentioned above we can say that the Huffman algorithm is non-deterministic which simply means that the output of the algorithm cannot be predicted even though the input is known. This is because while creating the tree, at several junctures you have more than one correct choice and since the generic Huffman algorithm doesn't specify a tie-breaking rule for the options any option can be selected resulting in multiple valid trees.

Now, while there are no rules in the generic Huffman algorithm we can create our own making the process less ambiguous. Let's make a list of such rules so that all trees you create for a particular sequence will be the same.

  1. For each pair of branches, the left branch is assigned "0" and the right branch is assigned "1".

  2. All nodes are placed in the descending order of their frequency. Meaning the nodes with higher frequency are placed on the right side whereas the ones with less frequency are placed on the left. This also means for any parent node the child on the right side will always have a higher frequency compared to the child node on the left.

  3. When multiple nodes/characters have the same frequency they maintain the same order as their occurrence. For example in the sequence ABBACDDC. The order of occurrence for the characters is A->B->C->D. So we maintain the same order when they have the same frequency. In this case, the nodes for "C" and "D" are made first then the nodes for "A" and "B" as shown in the figure below.

Another tree

Using these rules we can consistently produce the same tree for the same sequence. This allows us to easily use and compare Huffman codes without having to keep the other possible permutations in mind.

Please note that the other permutations of the Huffman tree are still completely valid and useable and the use of these rules is simply to reduce confusion for tasks.

Huffman encoding pseudocode

Now let's look at the pseudocode for the Huffman encoding. First, you need to build the Huffman tree, as the tree is necessary to obtain the Huffman code. The algorithm uses a min-priority queue (which is most efficiently implemented using a min-heap) to build this tree.

A min-heap is the correct data structure for this task because its primary function is to quickly find and extract the element with the lowest value. In this algorithm, we use it to repeatedly get the two nodes with the minimum frequencies.

It's important to distinguish between the tool (min-heap) and the result (Huffman tree):

  • min-heap: A data structure where every parent node is less than or equal to its children. We use this to build the tree.

  • Huffman tree: The final tree is not a min-heap. In this tree, a parent node's frequency is the sum of its children, so the parent node is greater than its children.

The full process, which is also shown in the pseudocode below, is as follows:

  1. Create a leaf node for each character and its frequency.

  2. Insert all these leaf nodes into the min-heap.

  3. While the heap has more than one node:

    • Extract the two nodes with the lowest frequencies.

    • Combine them into a new internal parent node (whose frequency is the sum of its children).

    • Insert this new parent node back into the min-heap.

The single node left in the heap is the root of the Huffman tree.

// to build the Huffman Tree
function buildHuffmanTree(characters, frequencies) is
    create min_heap      // A min heap to store nodes sorted by frequencies
    for (each character, frequency in characters, frequencies) do
        create new_node // Create a new node for each character with its frequency
        insert new_node into min_heap
    while (min_heap has more than one node) do
        create new_node // Create a new internal node
        new_node.left_child = node_with_lowest_frequency // Remove the node with the lowest frequency from min_heap and set it as the left child of the new_node
        new_node.right_child = node_with_lowest_frequency // Remove the node with the second-lowest frequency from min_heap and set it as the right child of the new_node
        new_node.frequency = sum_of_left_and_right_child_frequencies // Calculate the sum of frequencies of left_child and right_child and assign it to new_node.frequency
        insert new_node into min_heap // Insert the new_node back into min_heap
    return the remaining node in min_heap // The last remaining node is the root of the Huffman Tree

After generating a tree you now need to generate Huffman codes using the tree.

To generate the codes from the tree you need to traverse through the tree. For every node, you need to check whether it is a leaf node(i.e. it is a node that contains a character) or not. If the node is a leaf node store the character and its code into a dictionary. If the node is not a leaf node then you need to traverse through both the left child and right child nodes of the current node. When you traverse to the left node add "0" to the code of the node and add "1" if you traverse to the right node. Repeat this till all the nodes have been traversed and the code for every leaf node is recorded.

// to generate Huffman Codes from the Huffman Tree
function generateHuffmanCodes(huffman_tree_node, current_code) is
    if (huffman_tree_node is a leaf node) then
        character = huffman_tree_node.character
        code = current_code
        store character and code in a dictionary // Store the character and its corresponding Huffman code in a dictionary
    else
        generateHuffmanCodes(huffman_tree_node.left_child, current_code + "0") // Traverse left with an appended "0" to the current_code
        generateHuffmanCodes(huffman_tree_node.right_child, current_code + "1") // Traverse right with an appended "1" to the current_code

The complexity of Huffman coding

Let's look at the complexity of the Huffman algorithm in terms of time and space:

Time Complexity: The overall time complexity of the Huffman algorithm is O(n log n), where 'n' represents the number of symbols (characters) in the input data. The main steps contributing to this time complexity are:

  • Building the Frequency Table: In the first step, you need to look at the input data and count the frequency of each symbol (character). This step takes O(n) time because each symbol needs to be processed once.

  • Creating the Huffman Tree: To construct the Huffman tree, you use a priority queue (min-heap) to merge nodes. Extracting minimum frequency from the priority queue takes place 2*(n-1) times and its complexity is O(log n), since there are 'n' elements, the total time spent on creating the Huffman tree is O(n log n).

  • Constructing Huffman Codes: Once the Huffman tree is constructed, you need to determine the Huffman codes for each character. This takes O(n) time since each character is visited once.

So, the overall time complexity of the Huffman algorithm is the sum of the above steps, which is O(n + n log n + n) = O(n log n).

Space Complexity: The algorithm needs to store the frequency table and the Huffman tree, both have 'n' elements in the worst case so the space complexity of the Huffman algorithm is O(n).

Applications of Huffman Coding

Huffman Coding is used in a variety of applications. It is a very efficient and useful lossless method of compression data. Some of its applications include:

  • File Compression: Huffman coding is widely used in file compression algorithms (like ZIP) to reduce file sizes for storage and faster transmission.

  • Image Compression: Huffman coding is applied in image compression formats (such as JPEG) to reduce the size of images while maintaining acceptable quality.

  • Video Compression: Huffman coding is used in video compression standards (like MPEG) to compress video files, making them easier to store and stream.

  • Text Compression: Huffman coding is effective in compressing text documents, reducing their size for efficient storage and faster transmission.

  • Network Communication: Huffman coding is utilized in network protocols to compress data before transmission, optimizing bandwidth usage and enhancing network efficiency.

  • Data Storage: Huffman coding is employed in data storage systems to maximize storage capacity by compressing data before storing it.

  • Web Page Compression: Huffman coding is applied to compress HTML, CSS, and JavaScript files, making web pages load faster and reducing bandwidth usage.

Conclusion

In conclusion, Huffman coding is a very useful algorithm that helps make files smaller. It uses shorter codes for common things, making storage and sending faster. From compressing files to encoding images and videos, Huffman coding is frequently used in our digital world. By using Huffman coding, we can save space, send data faster, and manage information more effectively. It's like a superpower for making data smaller and more efficient!

6 learners liked this piece of theory. 1 didn't like it. What about you?
Report a typo