Computer scienceAlgorithms and Data StructuresAlgorithmsString coding algorithms

Data compression and encoding

6 minutes read

With each day technology plays an even more important in our lives. There are specific types of data due to its great variety, which grows all the time. For instance, we have image files, audio files, video files, text files etc. and etc.
We use respective programs for these files and sometimes share them with our friends or family members. This brings us to two important concepts which we are going to take a look at in a second.

What is encoding and what is compression?

These two concepts are related to the representation and manipulation of digital data.

When we speak about encoding we mean converting information from one form to another. This is done for storage, transmission or processing reasons by representing the data in a specific format. A basic example of an encoding is UTF-8 which is nowadays a standard in all the editors and websites. An older encoding, for instance, which is ASCII does not support any other characters than Latin, which will show you as unknown symbols. You may be familiar with that, for instance, if new emojis are being sent to you and you did not update your phone software to meet the latest version.

With data compression we reduce the size of data files by encoding the information in a more efficient way. The goal is to save storage space or make the data transmission faster. It is worth mentioning that compression can be either lossless or lossy. An example of lossless compression are ZIP files or installer packaging. Why is this widely used for archived files? Because this offers the ability to reconstruct the files from the archive, while not sacrificing or losing information. An example of lossy compression are JPEG images. Why is this kind of compression popular? The compression rate that the images achieve and the reduction in initial information is not much visible to the human eye and this can offer more possibilities for both storing and transmission purposes.

Usage of encoding and compression in our everyday life

As mentioned previously before, due to rich amounts of information these concepts are widely used everywhere.
Here is a list of some some of them, where compression and encoding are used on a regular basis.

Digital Communication:
- Data is encoded for transmission purposes over the internet, radio waves, or optical fibers.
  Encoding of visual data, images, audio, and video.
- Data is compressed to reduce the amount of data transferred, to improved bandwidth, and speed up the communication.
  Compression for video streaming, online gaming.
Web Technologies:
- Data is encoded to handle web page's text data in various languages.
  UTF-8, UTF-16, ASCII.
- Data is compressed in order to reduce loading times and size over the internet.
  gzip, Brotli.
Data Storage:
- Data is encoded to ensure better representation in storage or databases.
  Encoding of text, numbers, and other types of data.
- Data is compressed to reduce the required storage space for digital data.
  Formats like ZIP for archives and compression algorithms like gzip are commonly used

Now that we got acquainted with the concepts of encoding and compression, it is time to get familiar with the popular algorithms.

Brief descriptions of the algorithms

The first one is Run Length Encoding or RLE for short.

RLE's purpose is to reduce the size of data by representing consecutive identical elements as a single value followed by the number of this element's occurrences. Let us take a look at a basic example:

Our string: $AAAAAAABBBBBBCCCCCBBBDDDD$ .

Summary RLE's steps:

First, we use our string as input.
Then we compare the first character to our subsequent characters in the string, increasing our count value for each match. If the characters do not match, we save the previous character and its occurrence number.
After that, we compare our new character with the following characters. If they match, we increase the count value; otherwise, we store the character and its occurrence number.
We repeat step up until the end of the string.

The code for RLE looks like this:

def run_length_encoding(data):
    encoded_data = ""
    count = 1

    # Iterate through the input data
    for i in range(1, len(data)):
        if data[i] == data[i - 1]:
            count += 1
        else:
            encoded_data += str(count) + data[i - 1]
            count = 1

    # Add the last run
    encoded_data += str(count) + data[-1]

    return encoded_data

The encoding: $(A,7), (B, 6), (C,5),(B,3),(D,4)$

The basic principle behind this algorithm is to represent the sequences in a shorter format which reduces the storage space. Each entry in the sequence above would take 4 bytes, but instead it is only 20 bytes, because now each sequence representation takes 4 bytes.

Advantages:

Easy to implement and easily understandable.
Effective for long runs of the same repeated data.

Limitations:

Inefficient for non-repetitive data.
Works best with a fixed alphabet.

Page dedicated to the algorithm itself: Run Length Encoding

The second one is Shannon-Fano coding.

This algorithm is a lossless compression algorithm, which means that no information is lost during compression. Its purpose is to compress data by assigning variable-length codes to different symbols, where more frequent symbols get shorter codes.

First of all, we need to build a Shannon-Fano tree, let us take the initial data $AAAABBBCCD$ as an example.

The summary of steps:

We count the frequence or probability of each letter. $\rightarrow A: 0.4, B: 0.3, C: 0.2, D: 0.1$
We divide the table into two parts so that the total probability of both parts is as close to each other as possible.
We repeat step 2 until all symbols are split into individual subgroups.
Assign $0$ to the left side of the tree and $1$ to the right side respectively. We get the finalized Shannon-Fano tree.

The subgroup division: $('A':4), ('BCD': 6) \rightarrow ('A':4), ('B': 3), ('CD': 3) \rightarrow ('A':4), ('B': 3), ('C': 2), ('D': 1)$

Binary representation of frequency: $A : 0, B : 10, C : 110, D : 111$ , because we are going down the binary tree.

More frequent symbols are assigned with shorter codes meanwhile the less frequent get longer codes. The algorithm divides the symbols into groups until each group represents a single symbol to which a binary code is assigned accordingly.

Advantages:

Efficient for data with varying symbol frequencies.
Attempts to minimize the average code length.

Limitations:

Algorithm may not always produce the most efficient codes by trying to reach optimization.
Prior knowledge of symbol frequencies is needed.

Page dedicated to the algorithm itself: Shannon-Fano coding

The third algorithm on our list is the Huffman Coding.

Its purpose is to compress data without losing any information, in other words lossless. In which variable-length codes are assigned to input symbols, with more frequent symbols getting shorter codes.

Suppose we have the following input data: $A \space A \space B \space C \space D$ and we start off by building a Huffman tree:

First, we need to know how often each letter or symbol appears in our message. $\rightarrow A: 2, B: 1, C: 1, D: 1$
Next, we compare the frequencies of the nodes. We take the two nodes with the lowest frequencies and combine them into a new parent node.
Now we compare the remaining frequencies for the tree and repeat until we get single tree containing all the nodes.
Assign $0$ to the left branch and $1$ to the right one.

A min-heap is a binary tree, where data contained in each node is less than (or equal to) the data in that node’s children.

Assigning binary codes: $A : 0, B : 10, C : 110, D : 111$

Output: $0010110111$

Advantages:

Prefix code construction in order to minimize average code length
Straightforwards and efficient decoding.

Limitations:

Prior knowledge of symbol frequencies is assumed.
The variable-length nature of codes may lead to slight inefficiencies in storage.

Page dedicated to the algorithm itself: Huffman coding

The last algorithm is the Lempel-Ziv-Welch compression or LZW for short.

LZW is a universal lossless data compression algorithm, its purpose is to reduce data size by encoding repetitive sequences of symbols with variable-length codes, the algorithm is dictionary-based.

Suppose we have the following input data: $ABABABABA$ and we go through the following steps for the LZW algorithm:

First of all, we start off by initializing a dictionary:
Secondly, we start with an empty buffer and read the first symbol in our sequence, which is $A$ and add it to the buffer.
We proceed to the next symbol $B$ and check whether $AB$ is in the dictionary:
- Since $AB$ is not in our dictionary, we output the code for $A(0)$ and add $AB$ to our dictionary with a new code.
- We reset our buffer to $B$ .
We move to the next symbol which is $A$ and check whether $BA$ exists in our dictionary:
- Since $BA$ is not in our dictionary, we output the code for $B(1)$ and add $BA$ to our dictionary with a new code.
- We reset our buffet to $A$ .
Repeat the previous steps with the same logic until the input data ends.

As a result our encoded data looks like this: $0 \space 1 \space 2 \space 3$ , each code representing a sequence of symbols in our original data.

The principle works the following way: we look for the longest character available in the dictionary which is done at the initial step and then with each iteration we assign new codes to longer characters. Common patterns get replaced by shorter codes and new patters get added to the dictionary, this is done until all data is entered.

Advantages:

Adapts to the input data by dynamically updating the dictionary.
Good for compressing repetitive patterns in data.

Limitations:

Dictionary size may grow rapidly for certain data types.
Output code are fixed in size, which may be inefficient for short patters.

When to use each algorithm, basic use cases:

RLE (Time Complexity: $\mathcal{O}(n)$ ) is effective when:

Long sequences of the same symbol appear in the input data.
The set of possible symbols is small.
It provides good storage efficiency for repeated symbols.
It has low computational complexity, which enables fast compression and decompression.

Shannon-Fano coding (Time Complexity: $\mathcal{O}(n \log n)$ ) is effective when:

It's used for text compression, but Huffman coding is often preferred today.
You can estimate the frequency of different symbols in the input data.

Huffman coding (Time Complexity: $\mathcal{O}(n \log n)$ ) is effective when:

You need optimal prefix codes.
You can estimate the symbol frequencies.
You're compressing files like .zip archives.
Uniform code lengths are beneficial.

LZW (Time Complexity: $\mathcal{O}(n)$ ) compression is effective when:

There are repetitive patterns in the data.
You're dealing with image compression formats such as .gif or .tiff.
The algorithm adapts to dynamic input data.
There is variation in pattern lengths.

Conclusion

Run-Length Encoding is a simple yet effective compression method ideal for data with repeated patterns. It's a smart pick for when you need compression that doesn't use much resources.
Shannon-Fano coding is a variable-length coding algorithm that gives shorter codes to more common symbols. It offers a relatively straightforward way to compress data, but it isn't always the best.
Huffman coding is a widely-used algorithm for lossless data compression, especially good for data with different symbol frequencies. By making optimal prefix codes, it ensures efficient compression and decoding.
LZW compression is a flexible, dictionary-based algorithm that learns from the input data by substituting repetitive patterns with variable-length codes. LZW is well-liked for compressing text and files and works well with many types of data.

2 learners liked this piece of theory. 1 didn't like it. What about you?

Report a typo

Data compression and encoding

What is encoding and what is compression?

Usage of encoding and compression in our everyday life

Brief descriptions of the algorithms

The first one is Run Length Encoding or RLE for short.

The second one is Shannon-Fano coding.

The third algorithm on our list is the Huffman Coding.

The last algorithm is the Lempel-Ziv-Welch compression or LZW for short.

When to use each algorithm, basic use cases:

Conclusion

Related topics