We almost always think of repetitions as a bad habit, and we try to come up with new ways to deal with them. For example, is a huge number, and writing it repeatedly is a difficult task. We can write instead of , since it is much easier to handle. Here, tells us that we have to put nine zeroes after one. Power of is a similar concept to run-length encoding (RLE). If our data has a lot of repetitions, we can compress it quickly by noting how many times each unique piece of data has been repeated.
Idea of RLE
Meet Bob, a game developer who has messed up his sleeping schedule due to late-night programming. To fix his sleeping habits, he has enrolled in a sleep-aid program. Every morning, he answers the question, "Have I slept more than six hours?" to see whether the program is changing his sleeping habits. If the answer is affirmative, he writes down ; if the answer is negative, he writes down . He might forget to record the answer once or twice, in which case he will enter . Here is Bob's data, recorded for 20 days: .
Now consider this: how can we store Bob's data on a digital device with very limited memory? Let us assume that we need 4 bits to store each character. So, we will need bits in total to store the string, which is rather memory-consuming. However, by looking at the string, we can tell that every character has a few repetitions — a property we can exploit. Now, instead of storing every character separately, we can store the string like this: . Each pair here indicates the symbol and the number of its occurrence. This way, we will need bits in total to store the string, which is a lot of storage saving. Conceptually, RLE is exactly this. Now, our job is to learn how to convert a large string to symbol and occurrence pairs, i.e., encoding, and how to recover our initial string from the encoded data, i.e., decoding.
Before continuing to the following section, let us clarify one thing. Rather than writing the encoded data as , we could have inlined the symbol with its repetition number: . It looks more concise and some would say prettier. However, writing data in this way is considered a bad habit in programming, since it mixes up literal symbols and numeric values. Moreover, it may raise ambiguity during decoding. Take the following encoded data, for example: . It is unclear whether the initial string contains two hundred and thirty-four -s or just two -s and four -s. That's why we should avoid this type of representation.
Encoding algorithm
Below is a summary of the steps of RLE:
First, we use our string as an input.
Following that, we compare the first character in the string to our subsequent characters, increasing our count value for each match. In the event of a character mismatch, we save the character and its occurrence number, i.e. the count value in the encoded variable, and reset the count value to .
After that, we again start comparing our new character with the following characters. If they match, we increase the count value; otherwise, we store the character and its occurrence number.
We get the encoded data by repeating step 3 until we get to the final character of the string.
You can also see an illustration of the steps of RLE in the following flowchart:
When we perform the actions described above, we will see that certain characters do not repeat. In order to keep track of whether or not a character has been repeated, we establish a run-control variable. This is a string of logical values: and . Here we store logical in case of repetition, and logical otherwise. The significance of this variable will become apparent in the following section.
Example of encoding
Let's practice RLE using the string . To begin, we assign to the first character, , and to the second character, . We raise our counter value from to , since they match. After that, we advance by one character, which is again . As a result, our counter value rises to three. The fourth is within the same scenario, and our counter value is now four. When we continue, we find that doesn't match our as it travels to the fifth character, . This indicates that we've finished our first run, which has four -s in it. As a result, we add to our encoded stream. We also append logical to the run-control variable, since this run involves repetitions.
Now we set to a new character, , and reset our counter value to to prepare our program for the next run. We set to the character after , i.e. . This, however, is a mismatch. There are no repetitions in this run. We could write to our encoded stream, but this would cost us bits of storage: bits for and another bits for . We would use the storage for two characters to encode a single character. Clearly, this contradicts our resource-saving compression strategy. To get around this, we just save to the encoded stream. We'll append a one-bit logical to the run-control stream to inform the decoder that this character doesn't repeat.
We follow the same process for the subsequent runs. Finally, here's what our encoded stream looks like: , and the run-control variable is .
Decoding algorithm
Decoding procedures are straightforward. First, we read a bit from the run-control. If it's , it indicates that the encoded stream contains a pair of data: a character and the number of times it's repeated, . Hence, we write the character to the decoded stream times. The value in the run-control, on the other hand, indicates that there are no repetitions. It means that we have the initial character in the encoded stream, and we write it to the decoded stream only once. We can recover our original data from the encoded data by following these methods for every bit in the run-control.
Now, let us decode the string we have encoded in the previous section. We have the following encoded data: , as well as its run-control, . To begin with, we read the first bit from the run-control, which is . It means that the first character repeats in the initial string, so we read the corresponding pair of data and write to the decoded stream. The next bit of the run-control is . Hence, we only read the next character, , and write it to the decoded string. Following the same procedures for each bit of the run-control, the encoded data will convert back to the original string: .
Time complexity
Because we need to read each character to encode it, the encoding time is proportional to the length of the input string. As a result, the encoding time complexity is , where is the input string length. In cases when the data has no repetitions, a worst-case scenario takes place. It results in a negative compression of our data, with the output being greater in size than the input.
The length of the encoded data also influences the decoding time. The time complexity for this operation is , where is the length of the encoded data.
Conclusion
The run-length encoding method of data compression is relatively simple and effective. RLE may drastically compress large amounts of data with numerous repeating parts. This type of data compression has a wide range of uses. A good example is a bitmap image, where RLE is used to hold all of the pixel information, which is why these pictures are so compact. You can compress any data using RLE by following these steps:
Read the string.
Starting from the first character, compare a character with its subsequent characters and record the number of matches, until there is a mismatch. In case of a mismatch, record the character and its number of occurrences.
Then, starting with the new character, repeat match-counting and mismatch-recording procedures, until reaching the last character of the string.