More on Huffman encoding

Huffman tree generated from the exact frequenc...
Huffman tree generated from the exact frequencies in the sentence “this is an example of a huffman tree”. This is a remake from scratch of http://commons.wikimedia.org/wiki/Image:Huffman_tree.svg using Cairo. (Photo credit: Wikipedia)

As well as the basic form of Huffman coding, Huffman’s algorithm can also be used to encode streams to deliver compression as well as coding (at a cost of codes being not immediately decipherable).

Once again, with thanks to Information and Coding Theory, here’s a brief explanation.

With Huffman coding we can directly code a binary symbol stream S = \{s_1, s_2\} where s_1 = 0, s_2 = 1 .

But with encoding we can produce a code for S^n where n is arbitrary, e.g. S^3 = \{s_1s_1s_1, s_1s_1s_2, s_1s_2s_1, s_1s_2s_2, s_2s_1s_1, s_2s_1s_2, s_2s_2s_1, s_2s_2s_2 \}

So let us assume (e.g. we are transmitting a very dark image via our fax machine) that the probability of s_1 = \frac{2}{3} and for s_2 = \frac{1}{3}

Then we have the following probability distributions:

S :  \frac{8}{27} , \frac{4}{27} , \frac{4}{27} , \frac{2}{27} , \frac{4}{27} , \frac{2}{27} , \frac{2}{27} , \frac{1}{27}

S^{\prime}: \frac{8}{27}, \frac{4}{27} , \frac{2}{27} , \frac{4}{27} , \frac{2}{27} , \frac{2}{27} , \frac{3}{27}

S^{\prime\prime}: \frac{8}{27}, \frac{4}{27}, \frac{2}{27},\frac{4}{27},\frac{3}{27},\frac{4}{27}

S^{\prime\prime\prime}: \frac{8}{27},\frac{4}{27},\frac{4}{27},\frac{4}{27},\frac{5}{27}

S^{\prime\prime\prime\prime}: \frac{8}{27},\frac{4}{27},\frac{5}{27},\frac{8}{27}

S^{\prime\prime\prime\prime\prime}: \frac{8}{27},\frac{8}{27},\frac{9}{27}

S^{\prime\prime\prime\prime\prime\prime}: \frac{9}{27},\frac{16}{27}

S^{\prime\prime\prime\prime\prime\prime\prime}: 1

We can now compute the average word length in the Huffman code – by simply adding the probabilities of the newly created ‘joint’ symbols:

= \frac{3}{27} + \frac{4}{27} + \frac{5}{27} + \frac{8}{27} + \frac{9}{27} + \frac{16}{27} + 1 = \frac{72}{27}

Which obviously looks like the opposite of compression! But, of course we are encoding three symbols, so the actual word length per symbol becomes \frac{72}{81} = 0.88888888888...., in other words somewhat less than 1.