McMillian’s inequality

Kraft’s inequality states that: There is an instantaneous r-ary code C with word lengths l_1...l_q if and only if :sigma

\sum_{i=1}^q \frac{1}{r^{l_i}} \leq 1

But what of codes that are merely uniquely decodable, and not necessarily instantaneous?

There number is given to us by McMillian’s inequality and the surprising thing is that, despite the looser bound this is:

There is a uniquely decodable r-ary code C with word lengths l_1...l_q if and only if :

\sum_{i=1}^q \frac{1}{r^{l_i}} \leq 1

And here’s a proof (again drawn from Information and Coding Theory):

We have a code C with word lengths l_1 ... l_q and we define:

K = \sum_{i=1}^q \frac{1}{r^{l_i}}.

Additionally we say that x = max(l_1 ... l_q) and y = min(l_1 ... l_q)

Let us now consider K^n or (\sum_{i=1}^q \frac{1}{r^{l_i}})^n where n \geq 1.

This is \sum\sum...\sum \frac{1}{r^{l_i}} and is the sum of a series of products of the form:


(I found this a bit difficult to follow at first, but imagine we had (\frac{1}{A} + \frac{1}{B})(\frac{1}{C} + \frac{1}{D})(\frac{1}{E} + \frac{1}{F}), then we would could expand this as:

(\frac{1}{A}\times\frac{1}{C} + \frac{1}{A}\times\frac{1}{D} + \frac{1}{B}\times\frac{1}{C} + \frac{1}{B}\times\frac{1}{D})(\frac{1}{E} + \frac{1}{F})

\frac{1}{A}\times\frac{1}{C}\times\frac{1}{E} +\frac{1}{A}\times\frac{1}{C}\times\frac{1}{F} + \frac{1}{A}\times\frac{1}{D}\times\frac{1}{E} +\frac{1}{A}\times\frac{1}{D}\times\frac{1}{F} + \frac{1}{B}\times\frac{1}{C} \times\frac{1}{E}+ \frac{1}{B}\times\frac{1}{C}\times\frac{1}{F}+\frac{1}{B}\times\frac{1}{D}\times\frac{1}{E}+\frac{1}{B}\times\frac{1}{D}\times\frac{1}{F}

And we can also write:

\frac{1}{r^{l_1}}\times\frac{1}{r^{l_2}}...\times\frac{1}{r^{l_q}} = \frac{1}{r^j} where j = l_1 + l_2 +...+l_q.

Now yn \leq j \leq xn since y \leq l_i \leq x for all i .

Now, here’s the fancy bit (at least for me!)…

This means we can rewrite K^n as \sum^{xn}_{j=yn}\frac{N_{j,n}}{r^j}

The upper and lower bounds for j follow from the inequalities above (we could sum over a wider range but then the coefficient N_{j,n} would be zero beyond these bounds. And what is this coefficient?

N_{j,n} is the number of ways of writing any given j (or, alternatively, the number of different code word sequences whose total lengths sum to j ).

Let’s take a simple example again – we have a binary code C with code words 0,01,011 (plainly not instantaneous as 01 is a prefix of 011 ).

Here K = \sum_{i=1}^q \frac{1}{r^{l_i}} = \frac{1}{2} + \frac{1}{4} + \frac{1}{8} = \frac{7}{8}

But what if we look at K^2?

Then we have:

\frac{1}{2}\times\frac{1}{2} + \frac{1}{2}\times\frac{1}{4} + \frac{1}{2}\times\frac{1}{8} +\frac{1}{4}\times\frac{1}{2} + \frac{1}{4}\times\frac{1}{4} + \frac{1}{4}\times\frac{1}{8} + \frac{1}{8}\times\frac{1}{2} + \frac{1}{8}\times\frac{1}{4} + \frac{1}{8}\times\frac{1}{8}

Writing this out in the form above we then have – and you can see the coefficients:

\frac{1}{2^2} + \frac{2}{2^3} + \frac{3}{2^4} + \frac{2}{2^5} + \frac{1}{2^6}

The coefficient N_{j,n} can never be greater than r^j (as it can never be greater than the number of possible ways to arrange the code), so \frac{N_{j,n}}{r^j} \leq 1 .

We also know that K^n is a sum of xn + 1 - yn terms, so:

K^n \leq(x - y)n + 1.

As n increases, if K > 1 then the left hand side increases exponentially while the right (where x and y are independent of n) increases only linearly, hence K \leq 1.

I am sure this will be one of the less read postings on the blog (congratulations if you got this far), but if it helps just one person get a better undestanding of McMillan’s inequality then it has been worth it!


Entropy and randomness in communications, a quick introduction

Last year I remarked on how Roger Penrose‘s discussion of entropy in Cycles of Time did more to explain the physical nature of the concept (to me, at least) than many years of formal education.

And now, another book, Information and Coding Theory has given me some real insight into how computers deliver pseudo-randomness and why they look for sources of entropy.

The maths of all this are deceptively simple, but are of fundamental importance in information theory.

First of all, we can think of what sort of information events convey. There is a famous aphorism about the habits of bears, the direction from which the Sun rises and the title of the head of the Catholic Church. No newspaper or TV bulletin would ever bother to tell us, on a daily basis, that the Sun has indeed risen from the East. In effect that event conveys no information because it is completely expected. Toss a coin, though, and we do not know the outcome in advance – the result is of interest because the outcome is not known in advance.

Hence we can posit an information function I(e) which tells us how much information we can derive from event e , or, if we think of symbols in a data stream we can call this I(s_i) – the amount of information we can derive if the next symbol we see is s_i .

From our earlier discussion we can see that the result of this function is dependent on the probability of the event occurring or the symbol being received – i.e. if we knew the Sun was always going to rise in the East we can derive no information from that very thing happening. Hence:

I(s_i) is a function of  p_i where p_i is the probability that the next symbol to be seen is i . As we will only consider ‘memoryless’ events we will also insist that p_i is independent of time or any previously seen symbol. Hence:

I(s_is_j) (i.e. the information we can gain from seeing the symbol s_i followed by the symbol s_j is I(s_i) + I(s_j)

Taking these points together we can define I(s_i) = log\frac{1}{p_i} = -log(p_i)

We can see that I tends to infinity if the probability of an event tends to zero (think of the Sun rising in the West!) and tends to 0 if nothing unexpected happens.

The entropy of a system (communication) H(S) then becomes:

H(S) = \sum\limits_i^q p_iI(s_i) = \sum\limits_i^q p_ilog\frac{1}{p_i} = -\sum\limits_i^q p_i log(p_i)