# How to win at “Deal or No Deal”

English: Publicity photo from the television show Let’s Make a Deal. Pictured are host Monty Hall and announcer Jay Stewart with contestants. (Photo credit: Wikipedia)

I have never actually watched “Deal or No Deal“, but I am assuming it is the same as the “Let’s Make A Deal” TV gameshow described in Ian McEwan‘s spy romp Sweet Tooth- which I have just read in the last 18 hours of plane flights, mid-air turn arounds (bad weather at the destination rather than anything more serious) and airport lounges.

The novel is voiced by a Cambridge maths graduate (and low grade MI5 spook) who at one point explains to her lover the “paradox of choice” based on “Let’s Make A Deal”:

You have three boxes, one contains a prize, the two others are empty.

You pick one and then the dealer reveals one other to be empty. Do you stick with your original choice or pick the other box?

The naive answer is to stick with your original choice. After all the odds of the prize being in the box you first picked, or any other box –  are one in three – now increased to one in two.

But that is wrong.

In fact you double your chances of winning by picking the other box.

Think of it this way. If you picked an empty box as your first choice – and the odds are that you did as there are two of them and only one with a prize – then the dealer has no choice but to reveal the other empty box and so the third box must be the one with the prize.

So the odds of you picking the prize in this way are $\frac{2}{3} \times 1 = \frac{2}{3}$. The odds that you picked the prize the first time round are just $\frac{1}{3}$.

Of course, this is not a guaranteed way to win. But you would win two times in every three games if you followed this strategy.

Update: Having now read the Wikipedia entry on “Deal or No Deal” I can see it is similar to “Let’s Make a Deal”, but not the same and, in fact, rather more complex. I am not sure the above analysis of the so-called “Monty Hall Problem” (named after the presenter of “Let’s Make A Deal”) would be much of a guide to a contestant!

# More on Huffman encoding

Huffman tree generated from the exact frequencies in the sentence “this is an example of a huffman tree”. This is a remake from scratch of http://commons.wikimedia.org/wiki/Image:Huffman_tree.svg using Cairo. (Photo credit: Wikipedia)

As well as the basic form of Huffman coding, Huffman’s algorithm can also be used to encode streams to deliver compression as well as coding (at a cost of codes being not immediately decipherable).

Once again, with thanks to Information and Coding Theory, here’s a brief explanation.

With Huffman coding we can directly code a binary symbol stream $S = \{s_1, s_2\}$ where $s_1 = 0, s_2 = 1$.

But with encoding we can produce a code for $S^n$ where $n$ is arbitrary, e.g. $S^3 = \{s_1s_1s_1, s_1s_1s_2, s_1s_2s_1, s_1s_2s_2, s_2s_1s_1, s_2s_1s_2, s_2s_2s_1, s_2s_2s_2 \}$

So let us assume (e.g. we are transmitting a very dark image via our fax machine) that the probability of $s_1 = \frac{2}{3}$ and for $s_2 = \frac{1}{3}$

Then we have the following probability distributions:

$S$:  $\frac{8}{27} , \frac{4}{27} , \frac{4}{27} , \frac{2}{27} , \frac{4}{27} , \frac{2}{27} , \frac{2}{27} , \frac{1}{27}$

$S^{\prime}$: $\frac{8}{27},$ $\frac{4}{27} ,$ $\frac{2}{27} , \frac{4}{27} , \frac{2}{27} , \frac{2}{27} , \frac{3}{27}$

$S^{\prime\prime}$: $\frac{8}{27},$ $\frac{4}{27},$ $\frac{2}{27},\frac{4}{27},\frac{3}{27},\frac{4}{27}$

$S^{\prime\prime\prime}$: $\frac{8}{27},\frac{4}{27},\frac{4}{27},\frac{4}{27},\frac{5}{27}$

$S^{\prime\prime\prime\prime}$: $\frac{8}{27},\frac{4}{27},\frac{5}{27},\frac{8}{27}$

$S^{\prime\prime\prime\prime\prime}$: $\frac{8}{27},\frac{8}{27},\frac{9}{27}$

$S^{\prime\prime\prime\prime\prime\prime}$: $\frac{9}{27},\frac{16}{27}$

$S^{\prime\prime\prime\prime\prime\prime\prime}$: $1$

We can now compute the average word length in the Huffman code – by simply adding the probabilities of the newly created ‘joint’ symbols:

$= \frac{3}{27} + \frac{4}{27} + \frac{5}{27} + \frac{8}{27} + \frac{9}{27} + \frac{16}{27} + 1 = \frac{72}{27}$

Which obviously looks like the opposite of compression! But, of course we are encoding three symbols, so the actual word length per symbol becomes $\frac{72}{81} = 0.88888888888....$, in other words somewhat less than 1.

# Why isn’t the universe of infinite density?

Brian Greene at the World Science Festival launch press conference (Photo credit: Wikipedia)

Another speculation produced by Brian Greene’s The Hidden Reality: Parallel Universes and the Deep Laws of the Cosmos.

Imagine the universe was infinite (along the “quilted multiverse” pattern – namely that it streched on and on and we could only see a part). That would imply, assuming that the “cosmological principle” that one bit of the universe looked like any other, applied, that there were an infinite number of hydrogen atoms out there.

So, why is the universe not of infinite density? Because surely Shroedinger’s Equation means that there is a finite probability that electrons could be in any given region of space? (Doesn’t it?)

For any given electron the probability in “most” regions of space is zero in any measurable sense. But if there are an infinite number of electrons then the probability at a given point that there is an electron there is infinite, isn’t it?

OK, I have obviously got something wrong here because nobody is dismissing the “quilted multiverse” idea so simply – but could someone explain what it is I have got wrong?

Update: Is this because space-time is a continuum and the number of electrons a countable infinity?

# How fair is my dime, part 2

Continuing my Bayesian experiment with a dime coin. Part one is here.

So, let’s look at the posterior (outcome) probability density function after the next two tosses. First a head:

Now we can see the results near the extreme possibilities (ie H= 1, H = 0) being eliminated completely. Interestingly, although, in reality, the evidence (two heads, two tails) points towards a fair coin, ie H = 0.5, the pdf here peaks with H below 0.5, reflecting our earlier ‘estimate’ of a coin biased towards tails.

Now, another tail:

The peak is becoming narrower now, with some suggestion that there is a bias to tails, but of course with the coin having only been tossed 5 times, some such “bias” is unavoidable.

And again…

And again…

And the ninth toss:

Here the peak is becoming pronounced (and it also looks like the splines used to create the graph are showing signs of Runge’s phenomenon).

And the final, tenth, toss:

This does suggest a bias but the pdf is still quite broad – so there is lots of room for further correction. Indeed the thing that strikes me most is how little the suggested bias is after five heads in a row.

# Taylor expansion of a probability density function

To further improve my understanding of probabilities I am reading Data Analysis: A Bayesian Tutorial – which I thoroughly recommend for anyone with basic mathematical knowledge but wanting to grasp the key concepts in this area.

But I have a query about one of the concepts it introduces: a Taylor expansion of a probability density function around its maximum to give a confidence interval (this stuff matters for the day job as well as the computer science).

We have a probability density function $P(x)$ (in this context this is a posterior pdf from Bayes‘s equation, but that is not important, as far as I can tell, to the maths).

The maximum of $P(x) = P(x_0)$ and $\frac{d}{dx}P(x_0) = 0$.

To get a smoother function we look at $log_e(P) = L$. As logarithms are a strictly monotonic function $\frac{d}{dx}L(x_0) = 0$.

So expanding $L$, according to the book gives this:

$L = L(x_0) + \frac{1}{2}\frac{d^2L}{dx^2}_{x_0}(x - x_0)^2 + \ldots +$ (we ignore higher terms).

I am not sure why the derivatives are the coefficients of the expansion, but I can read up on that later, but given that I understand why there is no $\frac{dL}{dx}$ term as $\frac{dL}{dx}_{x_0} = 0$.

OK … well this is the power of blogging as a means of clarifying thought: because just as I was about to ask my question – why isn’t the first term dependent on $(x - x_0)$ – I realised the answer. The first term is, in fact the zeroth term of the expansion and so the dependency on $(x - x_0)$ is in fact a dependency on $(x - x_0)^0 = 1$.

# The binomial distribution, part 2

Image via Wikipedia

(Part 1 is here – these notes are to assist me, rather than contain any real news!)

So, if the probability that an event will happen to a single entity in a unit of time is $p$ and the probability it will not happen is $1 - p = q$, what is the probability that a large number of events, $k$, will take place?

Considering the Uranium-235 example again, lets say there are a very large number, $N$ of atoms: how many will decay and emit an $\alpha$-particle in a given second?

Well we know what the mean number of decays we should expect is: $Np$. But this is a random process and so will not always deliver that result, instead there will be a distribution of results around this mean.

What does the this distribution look like – ie., what is its probability mass function (pmf)?

For exactly $k$ decays let’s call this $f(k; N,p)$.

To show where the pmf comes from, let’s look at a much simpler example – tossing a coin four times and seeing if we get exactly one head, ie $k=1, N=4$.

One way we could get this is like this: HTTT. The probability of that happening is $p^kq^{N-k} = pq^3 = \frac{1}{16}$. But that is not the only way exactly one head could be delivered: obviously there are four ways: HTTT, THTT, TTHT, TTTH and so the probability of exactly one head is $\frac{4}{16}$.

(For two heads we have six ways of arranging the outcome: HHTT, HTHT, HTTH, THHT, THTH, TTHH and so the probability is $\frac{6}{16}$. For three heads the probability is the same as three tails (ie for one head), and the probabilities for all heads and all tails are both $\frac {1}{16}$. Cumulatively this covers all the possibilities and adds up to $1$.)

The generalisation of this gives us a pmf thus: $f(k; N,p) = _NC_k\ p^kq^{(N-k)}$, where $_NC_k$ is the binomial coefficient and can be spoken as “N choose k” and is the number of ways of distributing $k$ successes from $N$ trials.

$_NC_k = \frac{N!}{k!(N-k)!}$

There are approximately $10^{21}$ Uranium atoms in a gramme of the substance and calculating factorals of such large numbers efficiently requires an awful lot of computing power – my GNU calculator has been at it for some time now, maxing out one CPU on this box for the last 14 minutes, so I guess I am going to have to pass on my hopes of showing you some of the odds.

# The binomial distribution, part 1

Image via Wikipedia

I think there are now going to be a few posts here which essentially are about me rediscovering some A level maths probability theory and writing it down as an aid to memory.

All of this is related as to whether the length of time pages are part of the working set is governed by a stochastic (probabilistic) process or a deterministic process. Why does it matter? Well, if the process was stochastic then in low memory situations a first-in, first-out approach, or simple single queue LRU approach to page replacement might work well in comparison to the 2Q LRU approach currently in use. It is an idea that is worth a little exploring, anyway.

So, now the first maths aide memoire – simple random/probabilistic processes are binomial – something happens or it does not. If the probability of it happening in a unit time period is $p$ (update: is this showing up as ‘nm’? It’s meant to be ‘p’!) then the probability it will not happen is $1 - p = q$.  For instance this might be the probability that an atom of Uranium 235 shows $\alpha$-particle decay (the probability that one U 235 atom will decay is given by its half-life of 700 million years ie., $2.21\times10^{16}$ seconds, or a probability, if my maths is correct, of a given individual atom decaying in any particular second of approximately $4.4\times10^{-16}$.

(In operating systems terms my thinking is that if the time pages spent in a working set were governed by similar processes then there will be a half life for every page that is read in. If we discarded pages after they were in the system after such a half life, or better yet some multiple of the half life, then we could have a simpler page replacement system – we would not need to use a CLOCK algorithm, just record the time a page entered the system and stick it in a FIFO queue and discard it when the entry time was more than a half life ago.

An even simpler case might be to just discard pages once the stored number reached above a certain ‘half life’ limit. Crude, certainly, but maybe the simplicity might compensate for the lack of sophistication.

Such a system would not work very well for a general/desktop operating system – as the graph for the MySQL daemon referred to in the previous blog shows, even one application could seem to show different distributions of working set sizes. But what if you had a specialist system where the OS only ran one application – then tuning might work: perhaps that could even apply to mass electionics devices, such as Android phones – after all the Android (Dalvik) VM is what is being run each time.)