Computer scienceAlgorithms and Data StructuresAlgorithmsString coding algorithmsData compression

Run Length Encoding

10 minutes read

We almost always think of repetitions as a bad habit, and we try to come up with new ways to deal with them. For example, 10000000001000000000 is a huge number, and writing it repeatedly is a difficult task. We can write 10910^9 instead of 10000000001000000000, since it is much easier to handle. Here, 10910^9 tells us that we have to put nine zeroes after one. Power of 1010 is a similar concept to run-length encoding (RLE). If our data has a lot of repetitions, we can compress it quickly by noting how many times each unique piece of data has been repeated.

Idea of RLE

Meet Bob, a game developer who has messed up his sleeping schedule due to late-night programming. To fix his sleeping habits, he has enrolled in a sleep-aid program. Every morning, he answers the question, "Have I slept more than six hours?" to see whether the program is changing his sleeping habits. If the answer is affirmative, he writes down YY; if the answer is negative, he writes down NN. He might forget to record the answer once or twice, in which case he will enter OO. Here is Bob's data, recorded for 20 days: NNNNNNYYYYYYOOYYYYYYNNNNNNYYYYYYOOYYYYYY.

Now consider this: how can we store Bob's data on a digital device with very limited memory? Let us assume that we need 4 bits to store each character. So, we will need 20×4=8020 \times 4 = 80 bits in total to store the string, which is rather memory-consuming. However, by looking at the string, we can tell that every character has a few repetitions a property we can exploit. Now, instead of storing every character separately, we can store the string like this: (N,6)(Y,6)(O,2)(Y,6)(N,6)(Y,6)(O,2)(Y,6). Each pair here indicates the symbol and the number of its occurrence. This way, we will need 8×4=328 \times{} 4 = 32 bits in total to store the string, which is a lot of storage saving. Conceptually, RLE is exactly this. Now, our job is to learn how to convert a large string to symbol and occurrence pairs, i.e., encoding, and how to recover our initial string from the encoded data, i.e., decoding.

Before continuing to the following section, let us clarify one thing. Rather than writing the encoded data as (N,6)(Y,6)(O,2)(Y,6)(N,6)(Y,6)(O,2)(Y,6), we could have inlined the symbol with its repetition number: N6Y6O2Y6N6Y6O2Y6. It looks more concise and some would say prettier. However, writing data in this way is considered a bad habit in programming, since it mixes up literal symbols and numeric values. Moreover, it may raise ambiguity during decoding. Take the following encoded data, for example: N234N234. It is unclear whether the initial string contains two hundred and thirty-four NN-s or just two NN-s and four 33-s. That's why we should avoid this type of representation.

Encoding algorithm

Below is a summary of the steps of RLE:

  1. First, we use our string SS as an input.

  2. Following that, we compare the first character in the string to our subsequent characters, increasing our count value for each match. In the event of a character mismatch, we save the character and its occurrence number, i.e. the count value in the encoded variable, and reset the count value to 11.

  3. After that, we again start comparing our new character with the following characters. If they match, we increase the count value; otherwise, we store the character and its occurrence number.

  4. We get the encoded data by repeating step 3 until we get to the final character of the string.

You can also see an illustration of the steps of RLE in the following flowchart:

illustration of the steps of RLE

When we perform the actions described above, we will see that certain characters do not repeat. In order to keep track of whether or not a character has been repeated, we establish a run-control variable. This is a string of logical values: 00 and 11. Here we store logical 11 in case of repetition, and logical 00 otherwise. The significance of this variable will become apparent in the following section.

Example of encoding

Run-length encoding

Let's practice RLE using the string S=YYYYONNNYYYYYS = YYYYONNNYYYYY. To begin, we assign C1C1 to the first character, YY, and C2C2 to the second character, YY. We raise our counter value from 11 to 22, since they match. After that, we advance C2C2 by one character, which is again YY. As a result, our counter value rises to three. The fourth YY is within the same scenario, and our counter value is now four. When we continue, we find that C2C2 doesn't match our C1C1 as it travels to the fifth character, OO. This indicates that we've finished our first run, which has four YY-s in it. As a result, we add (Y,4)(Y, 4) to our encoded stream. We also append logical 11 to the run-control variable, since this run involves repetitions.

Now we set C1C1 to a new character, OO, and reset our counter value to 11 to prepare our program for the next run. We set C2C2 to the character after OO, i.e. NN. This, however, is a mismatch. There are no repetitions in this run. We could write (O,1)(O, 1) to our encoded stream, but this would cost us 88 bits of storage: 44 bits for OO and another 44 bits for 11. We would use the storage for two characters to encode a single character. Clearly, this contradicts our resource-saving compression strategy. To get around this, we just save OO to the encoded stream. We'll append a one-bit logical 00 to the run-control stream to inform the decoder that this character doesn't repeat.

We follow the same process for the subsequent runs. Finally, here's what our encoded stream looks like: (Y,4)O(N,3)(Y,5)(Y,4)O(N,3)(Y,5), and the run-control variable is 10111011.

Decoding algorithm

Decoding procedures are straightforward. First, we read a bit from the run-control. If it's 11, it indicates that the encoded stream contains a pair of data: a character and the number of times it's repeated, nn. Hence, we write the character to the decoded stream nn times. The 00 value in the run-control, on the other hand, indicates that there are no repetitions. It means that we have the initial character in the encoded stream, and we write it to the decoded stream only once. We can recover our original data from the encoded data by following these methods for every bit in the run-control.

Now, let us decode the string we have encoded in the previous section. We have the following encoded data: (Y,4)O(N,3)(Y,5)(Y,4)O(N,3)(Y,5), as well as its run-control, 10111011. To begin with, we read the first bit from the run-control, which is 11. It means that the first character repeats in the initial string, so we read the corresponding pair of data (Y,4)(Y, 4) and write YYYYYYYY to the decoded stream. The next bit of the run-control is 00. Hence, we only read the next character, OO, and write it to the decoded string. Following the same procedures for each bit of the run-control, the encoded data will convert back to the original string: YYYYONNNYYYYYYYYYONNNYYYYY.

Time complexity

Because we need to read each character to encode it, the encoding time is proportional to the length of the input string. As a result, the encoding time complexity is O(n)O(n), where nn is the input string length. In cases when the data has no repetitions, a worst-case scenario takes place. It results in a negative compression of our data, with the output being greater in size than the input.

The length of the encoded data also influences the decoding time. The time complexity for this operation is O(n)O(n) , where nn is the length of the encoded data.

Conclusion

The run-length encoding method of data compression is relatively simple and effective. RLE may drastically compress large amounts of data with numerous repeating parts. This type of data compression has a wide range of uses. A good example is a bitmap image, where RLE is used to hold all of the pixel information, which is why these pictures are so compact. You can compress any data using RLE by following these steps:

  1. Read the string.

  2. Starting from the first character, compare a character with its subsequent characters and record the number of matches, until there is a mismatch. In case of a mismatch, record the character and its number of occurrences.

  3. Then, starting with the new character, repeat match-counting and mismatch-recording procedures, until reaching the last character of the string.

6 learners liked this piece of theory. 0 didn't like it. What about you?
Report a typo