Huffman Coding | DAA

Huffman coding

Huffman coding is an algorithm for the lossless compression of files based on the frequency of occurrence of a symbol in the file that is being compressed. In any file, certain characters are used more than others. Using binary representation, the number of bits required to represent each character depends upon the number of characters that have to be represented. Using one bit we
can represent two characters, i.e., 0 represents the first character and l represents the second character. Using two bits we can represent four characters, and so on. Unlike ASCII code, which is a fixed-length code using seven bits per character, Huffman compression is a variable-length coding system that assigns smaller codes for more frequently used characters and larger codes
for less frequently used characters in order to reduce the size of files being compressed and transferred.

For example, in a file with the following data: ‘XXXXXXYYYYZZ. The frequency of "X" is 6, the frequency of "Y" is 4, and the frequency of "Z" is 2. If each character is represented using a fixed-length code of two bits, then the number of bits required to store this file would be 24, i.e., (2 x 6) + (2): 4) + (2): 2] = 24. If the above data were compressed using Huffman compression, the more frequently occurring numbers would be represented by smaller bits, such as: X by the code 0 (1 bit),Y by the code 10 (2 bits) and Z by the code 11 (2 bits}, the size of the file becomes 18, i.e., (h 6) + [2 x 4] + (2 x 2) = 18. In this example, more frequently occurring characters are
assigned smaller codes, resulting in a smaller number of bits in the final compressed file.
Huffman compression was named after its discoverer, David Huffman.

To generate Huffman codes, we should create a binary tree of nodes. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has up to n leaf nodes and  n — l. internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths. The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent, then a new node whose children are the 2 nodes with smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. With the previous 2 nodes merged into one node and with the new 11ode being now considered, the procedure is repeated until only one node remains, the Huffman tree. The simplest construction algorithm uses a priority queue where the 11ode with lowest probability is given highest priority:


The following example bases on a data source using a set of five different symbols The
symbol's frequencies are:
Symbol                                               Frequency
A                                                          24
B                                                          12
C                                                          10
D                                                         5
E                                                          8
----> total 186 bit (with 3 bit per code word)

Huffman Coding | DAA

Huffman Coding | DAA

Huffman Coding | DAA


A greedy algorithm can construct Huffman code that is optimal prefix codes. A tree corresponding to optimal codes is constructed in a bottom up manner starting from the |C| leaves and |C|-1 merging operations. Use priority queue Q to keep nodes ordered by frequency. Here the priority queue we considered is binary heap.

n = |C|; Q = C1
For(i=1; i<=n-1; i++)
2 = Allocate-Node();
x = Extract-Min(Q);
y = Extract-Min(Q);
left(z) = x; right(z) = y;
f(z) = f(x) + (y);


We can use BuildHeap(C] to create a priority queue that takes O(11) time. Inside the for loop the expensive operations can be done in Oflogn) time. Since operations inside for loop executes for n-1 time total running time of Huffrnan algorithm is O(11logn).

Post a Comment