Computer scienceAlgorithms and Data StructuresAlgorithmsString coding algorithmsData compression

Shannon–Fano coding

3 minutes read

While downloading files you might have run across zipped files. To access the contents of this type of file you need to “extract” it first. After extraction the size of the file increases and now you can actively interact with it and its contents.

So, what happens when you extract a ZIP file?

We can use compressing software to compress files and folders. By compressing we reduce the storage requirements for the file. This is important because a smaller file is easier to store and faster to transfer. When we want to use the files we decompress it and get our original content again. This is how ZIP files work.

There are many algorithms that compress data. In this topic, we will learn more about one such algorithm called Shannon-Fano Coding, and explore how it compresses data and implements it.

Working Mechanism

Shannon-Fano coding is a lossless compression algorithm, which means no information is lost during the compression. So, how does Shannon-Fano Coding compress data without losing any information?

Let's imagine a multiple-choice test with 50 questions, and A, B, C, and D as the possible answers.

To store the answers, you would need to store a sequence of A/B/C/D's where each character would require 1 byte of storage space. So 50 answers would require a total of 50 bytes to store.

To compress this set of answers, you first need to count the number of times each option is correct. The option that repeats more will have a shorter code when compressed. Shannon-Fano coding is a variable-length encoding scheme; this means codes assigned to symbols can be of varying lengths.

Based on the frequency of the occurrence, you now arrange the options in descending order of their probability and use this table to create a Shannon-Fano tree. We will discuss how to create a Shannon-Fano tree later.

The Shannon-Fano tree for this particular problem looks like this:

The path from the top to each letter will be unique sequences of 00's and 11's - these will be the codes for our characters. Now we can create a table that contains the list of all options and the sequence that represents them. We can refer to this table as our codebook.

You can now use this codebook to encode or decode our list of options, replacing the code with the options and vice versa.

Note that this tree and table only work for the sequence it was made for. If the answer is changed, then you would need to create a completely new Shannon-Fano tree for the new scenario.

Building a Shannon-Fano Tree

Building a Shannon-Fano tree might seem a bit tricky at first, but let's break it down into simple steps.

Step 1: First, you need to know how often each letter or symbol appears in our message. We count the frequency of each letter. For example, in our test, "A" is correct 22 times, "B" 15 times, "C" 10 times, and "D" only 3 times. With this frequency, you can now create a table that arranges the symbols in descending order based on their occurrence probability.

Step 2: Next, you divide the table into two parts so that the total probability of both parts is as close to each other as possible.

Note that you can also use the frequencies of the parts as they function the same as probability for the given scenario.

Step 3: We repeat step 2 for every sub-table formed until all symbols are split into individual subgroups.

First, do it for the "BCD" subgroup.

Then, do it for the "CD" subgroup.

Step 4: Now, assign 0 to the left half and 1 to the right half of the tree. This gives us our finalized Shannon-Fano tree.

Once you have the Shannon-Fano tree, you can use it to encode and decode messages. To encode a message, replace each letter with its corresponding code from the tree. To decode a message, start at the root of the tree and follow the 0's and 1's until you reach a letter.

Shannon-Fano Pseudocode

Now let's look at the pseudocodes for the Shannon-Fano encoding.

Create ShannonFano Tree

First, you need to understand how to build the Shannon-Fano tree.

The code begins inside the function "buildShannonFanoTree()" and checks if there's only one symbol in the "Symbols" list. If so, it creates a leaf node using the function "CreateLeafNode()" for the single symbol. This is our base case.

function CreateLeafNode(Symbol) is    //Create a leaf node containing the symbol and its probability
  LeafNode = NewNode()
  LeafNode.symbol = Symbol
  LeafNode.probability = Symbol.probability

However, if there are more than one symbol, they are sorted by descending order of their probabilities. Next, the code splits the symbols using the function "FindSplitPoint()". This function calculates the split point based on symbol probabilities and returns it. The "splitPoint" calculated divides the symbols into two subsets and is found by accumulating symbol probabilities until the total reaches or exceeds 0.5. This split point is used to create high and low-probability subsets.

function FindSplitPoint(Symbols) is
  totalProbability = 0
  splitPoint = 0
  for (i = 0 to length(Symbols) – 1) do
    totalProbability += Symbols[i].probability
    if (totalProbability <= 0.5) then
      splitPoint = I
    else
      break
  return splitPoint

After splitting the symbols, the code builds child nodes for the split groups using the function "CreateInternalNode()". In this function, an internal node is created to represent the split point. This internal node will have two child nodes, which are obtained by recursively calling the "buildShannonFanoTree" function on the two subsets of symbols. The left child of the internal node corresponds to the subset of symbols from index 0 to the "splitPoint", while the right corresponds to the subset of symbols from "splitPoint + 1" to the end. The probability of the internal node is set as the sum of the probabilities of its left and right children.

function CreateInternalNode(LeftSymbols, RightSymbols) is	//Create an internal node representing the split
  InternalNode = NewNode()
  InternalNode.left = buildShannonFanoTree(LeftSymbols)
  InternalNode.right = buildShannonFanoTree(RightSymbols)
  InternalNode.probability = InternalNode.left.probability + InternalNode.right.probability

The child nodes can now be treated as their own main nodes which lets us recursively run the function "buildShannonFanoTree()" for both left and right nodes until the tree is complete.

The pseudocode then continues recursively. The "buildShannonFanoTree" function is called on both the left and right subtrees, allowing the tree to be constructed in a top-down manner. This process continues until all leaf nodes are created and the tree is fully made.

//to build the Shannon-Fano Tree
function buildShannonFanoTree(Symbols) is
 if (length(Symbols) == 1) then
  CreateLeafNode(Symbols[0])		//Base case: If there's only one symbol, create a leaf node
 else
  Sort Symbols by decreasing probability	//Sort symbols by their probabilities in descending order
  SplitPoint = FindSplitPoint(Symbols)	//Find the split point that divides the symbols into two subsets
  CreateInternalNode(Symbols[0 to SplitPoint], Symbols[SplitPoint + 1 to end])	//Create a new node to represent the split
  buildShannonFanoTree(Symbols[0 to SplitPoint])
  buildShannonFanoTree(Symbols[SplitPoint + 1 to end]) 

Create ShannonFano Codebook

After the tree is made, you need to create the codebook.

//to create a Shannon-Fano table from the tree
function generateShannonFanoTable(root):
  ShannonFanoTable = {}	// Initialize an empty dictionary for the Shannon-Fano table

function traverse(node, code) is
  if(node is a leaf node) then
    ShannonFanoTable[node.symbol] = code	// Assign the code to the symbol
  else
    traverse(node.left, code + "0")	// Append '0' for left child
    traverse(node.right, code + "1")// Append '1' for right child

  traverse(root, "")  // Start the traversal from the root with an empty code
  return ShannonFanoTable

Codebook Implementation

Finally, you can use the codebook to encode and decode the data.

//to implement Shannon-Fano coding using the table
function encodeWithShannonFanoTable(data, ShannonFanoTable) is
  encodedData = ""  // Initialize an empty string for the encoded data
  currentSymbol = ""  // Initialize a string to accumulate symbols

for (each character in data) do
  currentSymbol += character
  if(currentSymbol exists in ShannonFanoTable) then
    encodedData += ShannonFanoTable[currentSymbol]	// Append the corresponding codeword
    currentSymbol = ""	// Reset the current symbol to an empty string
  return encodedData

The Complexity of Shannon-Fano Coding

Let's look at the complexity of Shannon-Fano coding in terms of time and space.

Time Complexity: The overall time complexity of the Shannon-Fano algorithm is O(n log n), where 'n' stands for the number of symbols (characters) in the input data. The main steps contributing to this time complexity are:

  • Calculating Probabilities: Initially, you calculate the probabilities of each symbol in the input data. This step has a time complexity of O(n) because each symbol needs to be processed once.

  • Sorting Symbols: You then sort the symbols in order of descending probability. Sorting 'n' elements usually takes O(n log n) time with efficient sorting algorithms like quicksort or mergesort.

  • Partitioning and assigning codewords: In this step, you recursively partition the sorted list of symbols into two sets, aiming to balance their probabilities. You assign codewords based on these partitions. It is crucial to understand that each symbol is processed only once during this partitioning and codeword assignment process. Therefore, this step has a linear time complexity of O(n).

So, the overall time complexity of the Shannon-Fano algorithm is the sum of the above steps, which is O(n + n log n + n) = O(n log n).

Space Complexity: The algorithm needs to store the frequency table and the Shannon-Fano tree, both having 'n' elements in the worst-case scenario. So, the space complexity of the Shannon-Fano algorithm is O(n).

Applications of Shannon-Fano Coding

Shannon-Fano coding was one of the early methods developed for data compression and is less efficient than Huffman coding. While Shannon-Fano coding has its own uses, it has been largely replaced by Huffman coding and other more efficient compression techniques. Nonetheless, it is a valuable tool for educational purposes and can sometimes be implemented when specific constraints make it a convenient option. Some applications of Shannon-Fano coding include:

Text Compression: Shannon-Fano coding can be used to compress text data. While it is not as efficient as Huffman coding or more modern techniques like Arithmetic coding, it can still provide reasonable compression ratios for certain types of text data.

Legacy Systems: In situations where older systems or hardware constraints make it tough to implement more complex compression algorithms, Shannon-Fano coding can be a simpler alternative.

Educational Purposes: Shannon-Fano coding is often taught in introductory courses on data compression and information theory because it helps students understand the principles of variable-length coding and the trade-offs between code length and symbol frequency.

Image and Audio Compression: Although not commonly used in modern image and audio compression standards like JPEG and MP3, Shannon-Fano coding can still be applied in specific cases or as a part of experimental compression techniques.

Low-Complexity Applications: Shannon-Fano coding has fairly lower computational complexity compared to some other compression algorithms. It might be chosen for applications where computational resources are limited.

Conclusion

To conclude, Shannon-Fano coding is a compression algorithm that allows us to compress data using the frequency or probability of individual options. While it has been mostly replaced by more efficient algorithms, it serves its purpose as an educational tool to teach how compression works and can be used in scenarios where a less complex algorithm is required.

How did you like the theory?
Report a typo