# The long tail of a Zipf distribution

English: Statistical meaning of The Long Tail (Photo credit: Wikipedia)

Back in the days of the first dot.com bubble the talk was of the “long tail” – how web retailers could make a lot of money by selling small amounts of a large number of different things.

True enough, one of the great survivors of those days – Amazon – does make money from the “long tail” and no amounts of protesting on behalf of small local bookshops makes up for the fact that you can, more or less, buy any book in print (and many which aren’t) off Amazon and have it delivered to your door.

One way of describing these long tails is the “Zipf distribution” which, in its purest sense, states that the frequency of an item (as originally formulated a word in a language or corpus) is inversely proportional to its rank in the list. In other words:

$f \propto \frac{1}{R}$

So the frequency of the second most frequently occurring word or thing would be half the most frequently occurring and the third, one third and so on.

We can generalise this into:

$f = \frac{k}{R^n}$ where $k$ and $n$ are some constants.

For instance, it is found that, for cities in many countries the population, $p$ varies as $R^{-1.07}$.

The important thing about this distribution, if you are an internet sales director, is that the space under the graph can be huge – so that while enormous numbers of sales can be found at the top end  - think of the phenomenon of Fifty Shades of Greyearlier in 2012 – there can be plenty of money made selling The Annotated Turingtoo.

As a little thought experiment – assuming $n=1.07$ for Amazon/book sales, this means that Fifty Shades of Grey, now 34th in Amazon’s best seller list is probably selling something like 11,500 copies a week (compared to 500,000 at its peak), while the Annotated Turing, ranked at 35,681 is selling maybe…6 or 7. (I am guessing, though, that $n$ is probably greater than 1.07 for books.)

(I was in Istanbul airport recently and if the bookshops there are any guide, reports of that country’s soft Islamisation are over-cooked: Fifty Shades was piled in every corner.)

# My (computer science) book of the year

Image via Wikipedia

It wasn’t published in 2011 but The Annotated Turing: A Guided Tour Through Alan Turing’s Historic Paper on Computability and the Turing Machine is without doubt my computer science book of the year.

If books can change your life then The Annotated Turing changed mine – because it showed me just how strong the link between maths and computer science is and how fundamental maths is to an understanding of the nature of reality and truth. The world, in my eyes, has not been the same since I read this book last January.

If you are a computer science student the you must read this book!

And finally, Happy New Year to all.

# An idea I saw on normblog

Image via Wikipedia

Norman Geras is a great man. He’s a social, not a computer, scientist, and this year I have been mainly reading computer books. Still, here’s an idea I have pinched off his website (he got it from here). Match the prompts to books you have read in the last year – in my case mainly computer books.

One time on holiday: Understanding the Linux Virtual Memory Manager

Weekends at my house are: Code Complete

My neighbour is: Programming Pearls

My boss is: The Mythical Man Month

My superhero secret identity is: The Annotated Turing

You wouldn’t like me when I’m angry because: Mussolini’s Italy

I’d win a gold medal in: The Little Schemer

I’d pay good money for: Structure and Interpretation of Computer Programs

If I were Prime Minister I would: Groovy in Action

When I don’t have good books, I: P, NP, and NP-Completeness

Loud talkers at the cinema should be: Albion’s Fatal Tree

# In continued praise of “Programming Pearls”

I have read two books in the last year that have fundamentally changed the way I think about computers and programming – The Annotated Turing and Programming Pearls.

Programming Pearls in particular continues to inspire me in the way it makes you think about building better algorithms and using data structures better – and here’s a short tale about what I have just done (and why I did it) with my valext program.

Valext runs a program under step tracing to allow the detailed examination of patterns of memory use – interrogating the /proc/pid/pagemap at every step.

The way I was doing this was to query the /proc/pid/maps, find the blocks that were being mapped and then seek through /proc/pid/pagemap, reading off the page status.

For every page that was mapped I would build a data structure and stick it into a linked list -

struct blocklist {
struct blocklist *nextblock;
};
...
{
block = malloc(sizeof (struct blocklist));
block->nextblock = NULL;
if (*lastblock == NULL){
*lastblock = block;
} else {
(*lastblock)->nextblock = block;
*lastblock = block;
}
}


But the problem with this code – an $\mathcal{O}(n)$ algorithm – is that performance stinks. I was running it on a very underpowered laptop and got to about four days of wallclock time, in which time about 2 million steps had been taken.

So today I rewrote it with an essentially $\mathcal{O}(1)$ algorithm – I allocate an array of arbitrary size (512 elements in current code) and fill that up – with a 0 value being the guard (ie marking the end). If more than 512 elements are needed then another 512 will be allocated and chained to the top and so on…


struct blockchain {
int size;
struct blockchain *tail;
};

uint64_t* blockalloc(int size)
{
uint64_t *buf = calloc(size, sizeof(uint64_t));
return buf;
}

struct blockchain *newchain(int size)
{
struct blockchain *chain = NULL;
chain = calloc(1, sizeof(struct blockchain));
if (!chain)
return NULL;
free(chain);
return NULL;
}
chain->size = size;
return chain;
}

/* recursively free the list */
void cleanchain(struct blockchain *chain)
{
if (!chain)
return;
cleanchain(chain->tail);
free(chain);
return;
}

/* set up a list */
struct blockchain* getnextblock(struct blockchain** chain,
{
int match, t = 0;
uint64_t i;
const char* pattern;
regex_t reg;

pattern = "^([0-9a-f]+)-([0-9a-f]+)";
if (regcomp(&reg, pattern, REG_EXTENDED) != 0)
goto ret;
match = regexec(&reg, buf, (size_t)3, addresses, 0);
if (match == REG_NOMATCH || match == REG_ESPACE)
goto cleanup;
if (*chain == NULL) {
*chain = newchain(MEMBLOCK);
if (*chain == NULL)
goto cleanup;
}
{
if (t >= MEMBLOCK) {
struct blockchain *nxtchain = newchain(MEMBLOCK);
if (!nxtchain)
goto cleanup;
(*chain)->tail = nxtchain;
*chain = nxtchain;
t = 0;
}
t++;
}
cleanup:
regfree(&reg);
ret:
return *chain;
}



As I am only testing this code now I don’t know if it will be faster – though everything tells me that it should be (though I am not entirely clear if this memory allocation issue is the real bottleneck in the program or whether it is just that a Pentium III is a Pentium III and there is nothing much I can do about that).

What I do know is that if it was not for the inspiration of Programming Pearls I probably would not even have bothered trying.

# The diagonal proof

Image via Wikipedia

I am just reading Computability and Logic and it (or at least the old, 1980, edition I am reading) has a rather laboured explanation of Cantor’s 1891 proof of the non-enumerability of real numbers (the diagonal proof) so to reassure myself and because it is interesting I thought I’d set out a version here (closely modelled on that found in The Annotated Turing).

If you are already lost then – here’s what it’s about: it is a proof that the infinite number of real numbers (that is to say numbers with something after the decimal point eg 3.1) is a bigger infinity that the infinite number of integers – or if you want think in terms of fractions – the rational numbers. One – the infinite number of integers or rationals – is countable, the other is not.

This is a pretty mind-blowing proof and it is also fair to say that a minority of mathematicians have issues with Georg Cantor‘s theory of the continuum, but that is for another blog post, perhaps.

So here’s the example:

Let’s start setting out the real numbers that are less that one:

0.00000000000..... 0.10000000000...... 0.20000000000...... 0.30000000000...... 0.40000000000...... 0.50000000000...... 0.60000000000...... 0.70000000000...... 0.80000000000..... 0.90000000000..... 0.01000000000..... 0.02000000000.... ..... 0.11000000000.....

And so on. Now this looks like it might be enumerable (countable) by assigning an integer to element in the list- the first in the list is number 1, the second number 2 and so on. Hopefully you can see that this should, in theory, list all the numbers (except it doesn’t, as we will prove)

Bur let’s us take the number formed by the diagonal. In our case this is:

0.00000000000..... 0.10000000000...... 0.20000000000...... 0.30000000000...... 0.40000000000...... 0.50000000000...... 0.60000000000...... 0.70000000000...... 0.80000000000..... 0.90000000000..... 0.01000000000..... 0.02000000000....

ie 0.0000000000.......

And let’s add 1 to every digit, so we get 0.111111111111....: is this number in the list? We will now show you that it cannot be by checking off the numbers in the list one by one.

First of all, we can eliminate the first number in the list as a match, because its first digit is 0 and our number has 1 as the first digit. Then we can eliminate the second number in the list in the same way (ie it’s second digit does not match). In fact it is clearly the case that we can eliminate number N from the list because it will not match at digit N.

OK, you say, let’s fix that by adding this ‘new’ number to the list:

0.11111111111...... 0.00000000000...... 0.10000000000...... 0.20000000000...... 0.30000000000...... 0.40000000000...... 0.50000000000...... 0.60000000000...... 0.70000000000...... 0.80000000000...... 0.90000000000...... 0.01000000000...... 0.02000000000......

But let’s diagonalise that one:

0.111111111111..... 0.00000000000...... 0.10000000000...... 0.20000000000...... 0.30000000000...... 0.40000000000...... 0.50000000000...... 0.60000000000...... 0.70000000000...... 0.80000000000...... 0.90000000000...... 0.01000000000...... 0.02000000000......

And we get 0.1000000000...... Once again adding 1 to all the digits we get a new number 0.2111111111111.... – which we can show in the same way does not exist in the original list (set)

# The importance of Turing’s findings

Yesterday I was speaking to a friend – who knows about computers but not about computer science – who told me this blog was just too difficult: so I thought I’d seek to order my own thoughts after reading The Annotated Turing and assist humanity in general (!) by writing a piece that explains what I think is the most fundamental finding from Turing’s famous On Computable Numbers paper.

To begin one must remember that Turing’s paper is not about electronic computers at all: indeed while in the paper he refers to “computers” – what he means is a person who computes.

Turing’s paper does describe an entirely theoretical mechanical machine that replicates the actions of a “computer” in solving a mathematical process through simple steps. You could think of those steps in this form if you were the computer adding 1 to 1:

• Read in the first number: 1
• Read in the operation: +
• Read in the second number: 1
• Apply the operation to both numbers

Turing’s argument is that the only numbers we can compute are those that can be computed in this mechanical way. This, in itself, was – and in a way remains – a controversial statement because it says that at the heart of human consciousness is a mechanical process and that, fundamentally, there is nothing special about consciousness.

In a way this flies in the face of all our experience of (digital) computers. Whatever else we know about them we know they cannot “think” – they are certainly mechanical (in the sense that they solve problems in the step-by-step way outlined above ) but they are so radically less powerful, it would seem, that our consciousness, our thinking must be built upon something else.

This – the idea that consciousness is more than mechanical – is rather out of favour these days but it still has its adherents: you can read more about the fascinating “hard problem of consciousness” if you want to read more about it, for here we are going to take the materialist approach and argue that the brain is indeed a mechanical processor, albeit one that (perhaps because of massive parallelism, or specific adaptation, or whatever) is hugely ahead of where our electronic computers are today in certain fields (though the gap is clearly closing on just about all fronts).

So if we accept that the only means available of calculating is a mechanical process then we must also accept Turing’s finding that we can only calculate a tiny minority of numbers or problems.

To explain this we need to look at the concept of infinity, or more properly infinities, because, at least according to the widely accepted theory of Georg Cantor, there is more than one infinity – indeed there are an infinite number of infinities all of a different size – or order as the size is referred too.

Sounds like nonsense, doesn’t it? But here’s a mind experiment to help you understand.

Think of a ruler, marked out in centimetres (or inches if you are old fashioned). But this ruler stretches on infinitely – so an infinite number of centimetre markings are present. But each one of these markings can be associated with a natural number, zero, one, two and so on. This is an infinite list which is said to be enumerable (countable) – a number can be assigned each member of the set of markings on the ruler. The number of members a set has is called its cardinality, and so this set has what is called the cardinality of aleph null:

But think of that ruler again – how many points are there between the different markings? Clearly we can divide the ruler up over and over again and that means we must have a bigger infinity than aleph zero.

This infinity is of this order  – two raised to the power of aleph null – we discuss why below:

An aside: the continuum hypothesis

According to the continuum hypothesis  aleph one is the so-called cardinality of the continuum. (Aleph one being the next biggest set of infinite numbers than aleph null: the continuum hypothesis):

And so, goes the theory, aleph one is of the same order as the number of real numbers. A real number being an integer number followed by a decimal point and an infinite (aleph null order) set of numbers – in the case of the natural numbers this is an infinite set of zeros, etc. Hence, the hypothesis states:

======

In any case the important number here is two to the power of aleph null – the order of the set of real numbers.

And why is this the order of the real numbers – because it is the order of the power set of the natural numbers. The power set is the number of subsets of a given set:

To explain why this is the order of the set of real numbers think of a binary (0 or 1) expansion after the decimal point – each digit will then be countable – ie there will be a first digit, a tenth digit and a ten millionth digit – that means it is of order aleph null, but the way in which the 0 and 1s can be ordered is the power set of the countable numbers, in other words two to the power of aleph null: a number that by definition is too big to count:

Now, what Turing’s paper shows is that we can only build a machine that computes (in other words calculate) a countable number of numbers – in other words we can only calculate aleph null numbers – or alternatively, solve aleph null problems.

But the number of real numbers is two to the power of aleph null, a much bigger number (to repeat, it is actually too big to count).

This is was a shattering result to a generate of mathematicians who were accustomed to the idea that all mathematical problems were solvable.

On 8 August 1900, one of greatest mathematicians of the nineteenth and twentieth century, David Hilbert, told the Second International Congress of Mathematicians in Paris:

“However unapproachable these [mathematical] problems may seem to us and however helpless we stand before them, we have, nevertheless, firm conviction that their solution must follow by a finite number of purely logical processes … This conviction of the solvability of every mathematical problem is a powerful incentive to the worker. We hear within us the perpetual call. There is the problem. Seek its solution. You can find it by pure reason, for in mathematics there is no ignorabimus.”

But on 12 November 1936, the formal publication date of Turing’s paper, that confidence was destroyed for ever. An uncountable number of mathematical problems simply cannot be solved.

# Broken peer review is nothing new

Cover via Amazon

I have been pressing on with The Annotated Turing and have now made it to page 163. And I see no reason to withdraw my earlier enthusiastic partial review: I just wish I’d read it a year or more ago as it has no immediate bearing on the MSc this year, though is still a great read when it comes to a proper understanding of computing.

But what strikes me is that the book details a number of errors in the original “On Computable Numbers…” paper. Most of these appear to be typographical or similar mistakes, but they actually render Turing’s written argument broken in several places. And – even though one of them was immediately obvious to me (actually, I assumed I had misunderstood Turing’s point until I read Charles Petzold‘s commentary) they don’t seem to have been corrected in print for another decade.

Bluntly that suggests that whoever reviewed the paper did not too through a job on it. Of course, none of this means it was wrong to publish the paper – it is truly an intellectual milestone of the 20th, or any other century, but it does suggest that recent controversies over the weakness of peer review and the caprice of the editors of scientific journals is nothing particularly new.