## Representing one terabit of information

I picked up my copy of Roger Penrose‘s Cycles of Time: An Extraordinary New View of the Universe and reread his opening chapter on the nature of entropy (if you struggle with this concept as a student then I recommend the book for this part alone – his exposition is brilliant).

My next thought was to think if I could model his comments about a layer of red paint sitting on top of blue paint – namely could I just write a program that showed how random molecular movements would eventually turn the whole thing purple. Seemed like a nice little programming project to while a way a few hours with – and the colours would be lovely too.

Following Penrose’s outline we would look at a ‘box’ (as modelled below) containing 1000 molecules. The box would only be purple if 499, 500 or 501 molecules were red (or blue), otherwise it would be shaded blue or red. And we would have 1000 of these boxes along each axis – in other words $10^9$ boxes and $10^{12}$ molecules – with $5 \times 10^8$ boxes starting as blue (or red).

Then on each iteration we could move, say $10^6$ molecules and see what happens.

But while the maths of this is simple, the storage problem is not – even if we just had a bit per molecule it is one terabit of data to store and page in and out on every iteration.

I cannot see how compression would help – initially the representation would be highly compressible as all data would be a long series of 1s followed by a long series of 0s. But as we drove through the iterations and the entropy increased that would break down – that is the whole point, after all.

I could go for a much smaller simulation but that misses demonstrating Penrose’s point – that even with the highly constrained categorisation of what constitutes purple, the mixture turns purple and stays that way.

So, algorithm gurus – tell me how to solve this one?

Update: Redrew the box to reflect the rules of geometry!

## Similarity, difference and compression

I am in York this week, being a student and preparing for the literature review seminar I am due to give on Friday – the first staging post on the PhD route, at which I have to persuade the department I have been serious about reading around my subject.

Today I went to a departmental seminar, presented by Professor Ulrike Hahne of Birkbeck College (and latterly of Cardiff University). She spoke on the nature of “similarity” – as is the nature of these things it was a quick rattle through a complex subject and if the summary that follows is inaccurate, then I am to blame and not Professor Hahne.

Professor Hahne is a psychologist but she has worked with computer scientists and so her seminar did cut into computer science issues. She began by stating that it was fair to say that all things are equally the same (or different) – in the sense that one can find an infinite number of things by which two things can be categorised in the same way (object A is weighs less that 1kg, object B weighs less than 1kg, they both weigh less than 2kgs and so on). I am not sure I accept this argument in its entirity – in what way is an object different from itself? But that’s a side issue, because her real point was that similarity and difference is a product of human cognition, which I can broadly accept.

So how do we measure similarity and difference? Well the “simplest” way is to measure the “distance” between two stimuli in the standard geometric way – this is how we measure the difference between colours in a colour space (about which more later) ie., the root of the sum of the squares of the distances. This concept has even been developed into the “universal law of generalisation”. This idea has achieved much but has major deficiencies.

Professor Hahne outlined some of the alternatives before describing her interest (and defence of) the idea that the key to difference was the number of mental transformations required to change one thing from another – for instance, how different is a square from a triangle? Two transformations are required, first to think of the triangle and then to replace the square with the triangle and so on.

In a more sophisticated way, the issue is the Kolmogorov complexity of the transformation. The shorter the program we can write to make the transformation, the more similar the objects are.

This, it strikes me, has an important application in computer science, or it least it could have. To go back to the colour space issue again – when I wrote the Perl module Image::Pngslimmer I had to write a lot of code that computed geometrical distances between colour points – a task that Perl is very poor at, maths is slow there. This was to implement the so-called “median cut” algorithm (pleased to say that the Wikipedia article on the median cut algorithm cites my code as an example, and it wasn’t even me who edited it to that, at least as far as I can remember!) where colours are quantised to those at the centre of “median cut boxes” in the colour space. Perhaps there is a much simpler way to make this transformation and so more quickly compress the PNG?

I asked Professor Hahne about this and she confirmed that her collaborator Professor Nick Chater of Warwick University is interested in this very question. When I have got this week out the way I may have a look at his published papers and see if there is anything interesting there.

## More on Huffman encoding Huffman tree generated from the exact frequencies in the sentence “this is an example of a huffman tree”. This is a remake from scratch of http://commons.wikimedia.org/wiki/Image:Huffman_tree.svg using Cairo. (Photo credit: Wikipedia)

As well as the basic form of Huffman coding, Huffman’s algorithm can also be used to encode streams to deliver compression as well as coding (at a cost of codes being not immediately decipherable).

Once again, with thanks to Information and Coding Theory , here’s a brief explanation.

With Huffman coding we can directly code a binary symbol stream $S = \{s_1, s_2\}$ where $s_1 = 0, s_2 = 1$.

But with encoding we can produce a code for $S^n$ where $n$ is arbitrary, e.g. $S^3 = \{s_1s_1s_1, s_1s_1s_2, s_1s_2s_1, s_1s_2s_2, s_2s_1s_1, s_2s_1s_2, s_2s_2s_1, s_2s_2s_2 \}$

So let us assume (e.g. we are transmitting a very dark image via our fax machine) that the probability of $s_1 = \frac{2}{3}$ and for $s_2 = \frac{1}{3}$

Then we have the following probability distributions: $S$: $\frac{8}{27} , \frac{4}{27} , \frac{4}{27} , \frac{2}{27} , \frac{4}{27} , \frac{2}{27} , \frac{2}{27} , \frac{1}{27}$ $S^{\prime}$: $\frac{8}{27},$ $\frac{4}{27} ,$ $\frac{2}{27} , \frac{4}{27} , \frac{2}{27} , \frac{2}{27} , \frac{3}{27}$ $S^{\prime\prime}$: $\frac{8}{27},$ $\frac{4}{27},$ $\frac{2}{27},\frac{4}{27},\frac{3}{27},\frac{4}{27}$ $S^{\prime\prime\prime}$: $\frac{8}{27},\frac{4}{27},\frac{4}{27},\frac{4}{27},\frac{5}{27}$ $S^{\prime\prime\prime\prime}$: $\frac{8}{27},\frac{4}{27},\frac{5}{27},\frac{8}{27}$ $S^{\prime\prime\prime\prime\prime}$: $\frac{8}{27},\frac{8}{27},\frac{9}{27}$ $S^{\prime\prime\prime\prime\prime\prime}$: $\frac{9}{27},\frac{16}{27}$ $S^{\prime\prime\prime\prime\prime\prime\prime}$: $1$

We can now compute the average word length in the Huffman code – by simply adding the probabilities of the newly created ‘joint’ symbols: $= \frac{3}{27} + \frac{4}{27} + \frac{5}{27} + \frac{8}{27} + \frac{9}{27} + \frac{16}{27} + 1 = \frac{72}{27}$

Which obviously looks like the opposite of compression! But, of course we are encoding three symbols, so the actual word length per symbol becomes $\frac{72}{81} = 0.88888888888....$, in other words somewhat less than 1.

## A blast from the past: Huffman coding

Britain used to lead the world in the deployment of a cutting edge technology: the fax machine.

Back when I graduated from university in the last 1980s fax machines were the technology every business had to have and Britain had more of them per head of population than anywhere else.

Today I work in an office where we have decided having one is no longer necessary.

Why all this? Well, one thing I remember from the era of the fax machine is the frequent reference to “Huffman coding“. Indeed, back in the days when free software graphics libraries were in short supply, I investigated whether I could convert faxes into over graphics formats and kept coming across the term.

Now, thanks to Information and Coding Theory , I can fully appreciate the beauty of David A. Huffman‘s coding scheme.

Here’s my attempt to explain it…

We have a symbol sequence $S = s_1, s_2, ... ,s_{n - 1}, s_n$, these have a probability of occurring of $p_1, p_2, ... , p_n$, where $\sum\limits_{i = 1}^n p_i = 1$. We rearrange the symbols so that $p_1 > p_2 > ... p_n$

Now take the tow least probable symbols (or if there are more than two, any two of the least probable symbols) and connote them with one of the symbols picked to produce a new symbol sequence $S^\prime$ with $(n - 1)$ members e.g. : $S^\prime = s_1, s_2, ... s_{n - 1}$ with probabilities $p_1, p_2, ... (p_{n - 1} + p_n)$.

This process can then be repeated until we have simply one symbol with probability 1.

To encode this we can now ‘go backwards’ to produce an optimal code.

We start by assigning the empty symbol $\varepsilon$, then passing back up the tree, expanding the combined symbols and adding a ‘1’ to the ‘left’ hand symbol and a ‘0’ to the right (or vice versa): $\varepsilon$ 1  = 1, $\varepsilon$ 0 = 0.

Here’s a worked example: we have the (binary) symbols $S$: 0, 1, 10, 11, which have the probabilities of 0.5, 0.25, 0.125 and 0.125.

So $S^\prime$ = 0, 1, 10 with probabilities 0.5, 0.25, 0.25, $S^{\prime\prime}$ = 0, 1 with probabilities 0.5, 0.5 and $S^{\prime\prime\prime}$ = 0 with probability 1.

Now, to encode: $C^{\prime\prime\prime}: 0 = \varepsilon$ $C^{\prime\prime}$, 0 = 1, 1 = 0 $C^\prime$ 0 = 1, 1 = 01, 10 = 00 $C$ 0 = 1, 1 = 01, 10 = 001, 11 = 000

This code has an average length of $\frac{1}{2} + \frac{2}{4} + \frac{3}{8} + \frac{3}{8} = \frac{7}{4}$

(In fact pure Huffman coding is not generally used – as other forms e.g. adaptive Huffman, offer better performance, though the underlying principles are the same.)