A further point about Scottish referendum polls


Free Celtic Nationalist Jesus Surfers Sunshine...
Free Celtic Nationalist Jesus Surfers Sunshine Vote Yes Van (Photo credit: GerardFerry)

One (smallish) point I left out of my discussion of opinion polling a few days ago was that the “margin of error” for a 95% confidence interval varies according to the reported score.

This range of possible error is actually highest for parties or opinions that score 50% – which is more or less where the “no” vote in the Scottish referendum is polling now – for a 1000 sample poll with 95% confidence interval the range at 50% is:

\pm 2 \times \sqrt{\frac{0.5 \times 0.5}{1000}} = \pm 3.2\%

Good news for the yes camp? Not really, because apart from the obvious point that it is mathematically equally likely to be an over-estimate as an under-estimate, the corollary is that the error in smaller figures is less. For 30%, roughly where the yes campaign are, the error is:

\pm 2 \times \sqrt{\frac{0.3 \times 0.7}{1000}} = \pm 2.9\%

Advertisements

What not to say about the bad weather


Reduction of flood and associated extreme weat...
Reduction of flood and associated extreme weather costs is the primary benefit of climate change mitigation. (Photo credit: Wikipedia)

The last six weeks has seen the UK hit by a number of storms and surely climate change is to blame? Well, no, you cannot say that.

Not because climate change is not to blame. But because we just cannot definitively pin down a weather event to a change in climate. (Of course, this does not bother many on the other side of the argument, but those who defend science should judge themselves by higher standards.)

But we can, if the resources are available, calculate the statistical probability that specific weather events are linked to climate effects.

This is the abstract of a 2011 paper published in Nature:

Anthropogenic greenhouse gas contribution to flood risk in England and Wales in autumn 2000

Nature 470, 382–385 (17 February 2011)
doi:10.1038/nature09762
Received: 30 March 2010
Accepted: 10 December 2010
Published online: 16 February 2011

Interest in attributing the risk of damaging weather-related events to anthropogenic climate change is increasing1. Yet climate models used to study the attribution problem typically do not resolve the weather systems associated with damaging events2 such as the UK floods of October and November 2000. Occurring during the wettest autumn in England and Wales since records began in 17663, 4, these floods damaged nearly 10,000 properties across that region, disrupted services severely, and caused insured losses estimated at £1.3 billion (refs 5, 6). Although the flooding was deemed a ‘wake-up call’ to the impacts of climate change at the time7, such claims are typically supported only by general thermodynamic arguments that suggest increased extreme precipitation under global warming, but fail8, 9 to account fully for the complex hydrometeorology4, 10 associated with flooding. Here we present a multi-step, physically based ‘probabilistic event attribution’ framework showing that it is very likely that global anthropogenic greenhouse gas emissions substantially increased the risk of flood occurrence in England and Wales in autumn 2000. Using publicly volunteered distributed computing11, 12, we generate several thousand seasonal-forecast-resolution climate model simulations of autumn 2000 weather, both under realistic conditions, and under conditions as they might have been had these greenhouse gas emissions and the resulting large-scale warming never occurred. Results are fed into a precipitation-runoff model that is used to simulate severe daily river runoff events in England and Wales (proxy indicators of flood events). The precise magnitude of the anthropogenic contribution remains uncertain, but in nine out of ten cases our model results indicate that twentieth-century anthropogenic greenhouse gas emissions increased the risk of floods occurring in England and Wales in autumn 2000 by more than 20%, and in two out of three cases by more than 90%.

At the time of those floods I was a special adviser to Paul Murphy, the Secretary of State for Wales, and accompanied him when he visited stricken areas of North Wales. It would be an exaggeration to say we saw Biblical levels of distress and destruction, but  they were unlike anything I had seen at first hand.

As the abstract makes clear the paper certainly does not say “floods were caused by climate change” – indeed it goes so far as to say that “observed UK fluvial-flood and high-flow trends for recent decades suggest no clear evidence for any change above that of natural variability17, 18, mirroring the mixed picture in observed precipitation changes19, 20.”

And, moreover, the authors state that the risk of some flooding events may actually be reduced by the factors they say increased the flooding risk from heavy rainfall: climate change is complex, something else those who oppose the science that underlies it are often unwilling to concede as they focus on populist arguments about this week’s weather or short-term trends.

The computing for the project was organised through http://climateprediction.net .

You can read more about all of this in a news item on Nature’s website.

Glasgow - from Wikipedia

One thing the Scottish referendum campaign can do without: polling voodoo


English: Picturing 50 realisations of a 95%-co...
English: Picturing 50 realisations of a 95%-confidence interval (Photo credit: Wikipedia)

In the year ahead one of the biggest – probably the biggest – political stories in the UK will be the September referendum on whether Scotland should leave the UK.

I am not going to comment here on what I hope the outcome will be – other than to say I hope and believe there will be a strong ‘no’ vote.

But I am going to take issue with how the campaign is reported and, in particular, the dismal way in which opinion polls are covered.

My ire has been provoked by a claim by columnist in today’s Scotsman that a 1% change in one side’s support between two polls in September and December indicates the race is “tightening”.

My argument is that it indicates nothing of the sort. The two polls are essentially mathematically identical. I realise that “things just the same” does not, as a headline, sell many papers, but it does not make it acceptable to invent new mathematical facts where none exist. The fact that opinion polls today essentially show the same result as opinion polls of two months ago and – in this case – two years ago and twenty years ago – may be a journalistic disappointment, but it is also the reality.

So here is my brief guide to the mathematics of opinion polls. If you want to know more I strongly recommend the classic Statistics without Tears: An Introduction for Non-Mathematicians which, as the subtitle suggests, gives the reader a clear grounding with requiring a lot of maths knowledge.

I will begin with a few ground rules…

Firstly, remember what a poll is based on: not the truth about people’s opinions but what they say their opinions are. If some people systematically lie to pollsters (as, in certain cases, it is known they do because they might be afraid or ashamed to tell the truth) then your poll is flawed from the start. And the best you can say of any poll’s accuracy is that it is as good as the best poll can be.

Secondly, the best we can say about a poll is that, if conducted properly, it has a given degree of accuracy compared to any other poll. So when people talk of a “margin of error” in a poll, what they typically really mean is that 95% of all properly conducted polls will give an answer within that margin of error. (This is both an amplification of the first point but also completely independent of it – if people lie then they will likely lie to all pollsters and so no polls are immune.)

Thirdly, it is a mathematical fact that for even the best conducted polls, we should expect one in twenty to give results outside that “margin of error” – this isn’t because we can expect pollsters to mess it up one time in twenty, but because of the mathematical rules of the universe in which we live. It is an unavoidable feature of opinion polling. And because it is unavoidable we do not know which of the polls is the “rogue” and whether any seeming shift (or non-shift, remember) is because of this “rogue” effect or because of a real change in what people are likely to say to opinion pollsters.

And now a little bit of maths…

Claims about polling accuracy are based on the fact that opinion poll results (surveys of a small part of the population from which we hope to draw conclusions about the whole population) will be distributed about the “real” result (ie the answer we’d get if we asked every single person) in a bell-shaped “normal distribution“. The maths of this “normal distribution” are very well understood and so we can make some well-grounded claims about the potential accuracy of our polls.

These include the fact that, above a basic minimum sample size, the margin of error in our poll (i.e., the error compared to other polls) varies by the inverse of the square root of the sample size. Or to be blunt about it, a poll with 2000 respondents is not twice as precise (i.e., with half the margin of error) as one with 1000, but merely 1.4 times more accurate, while the gap between 2000 and 500 is not a shrinkage in the margin of error by a factor of 4 but of 2 (you can tell straight away that the economics of large scale polling is a bit perverse – if you go from a 1000 to 10000 sample poll, your costs increase by a factor of 10, but the margin or error only shrinks by a factor of 3).

The “one-in-twenty will be rogue” rule comes from the fact that when we talk about the margin of error in a poll what we really mean is that in 95% of all polls the result will be in a band twice the size of the margin of error, centred on the result we have published. This 95% figure is the “confidence interval” (more precisely this is the band of two “standard errors” in each direction about the sample mean).

You may interject now and say “but that doesn’t mean a 1% difference is not real” and you would be right – if you are willing to live with a lower confidence interval or pay for a very much bigger sample. So, to make a 1% figure “real” we might be prepared to live with a margin of error of 0.5% on either side of the reported poll result. We could get that in two ways – shelling out to increase the sample size to roughly 40,000 (compared to  the typical 1,000), which would keep our 95% confidence interval, or accepting that about 60% of polls would give a result that was not within +/- 0.5% of our figure – or, crudely, we were more likely to be wrong than to be right when we claimed the 1% was a “real” shift (we would have a 40% confidence interval).

Mixed feelings about the Turing pardon


Alan Turing was not ashamed of being gay and made little or no effort to hide it. In today’s parlance he was “out” – if not to the world then certainly to a large number of people.

I wonder if he would ever have asked for a ‘pardon’ – because his view was certainly he had done nothing that required a pardon.

The other factor, of course, is that thousands of people – many still alive – were prosecuted using the same repressive law under which Turing was victimised. Are they too to be pardoned? Or is it just that a high profile case, involving someone who cannot say anything that causes discomfort in response, is a handy pre-Christmas news sponge?

Update: Andrew Hodges makes the point much better than I can:

“Alan Turing suffered appalling treatment 60 years ago and there has been a very well intended and deeply felt campaign to remedy it in some way. Unfortunately, I cannot feel that such a ‘pardon’ embodies any good legal principle. If anything, it suggests that a sufficiently valuable individual should be above the law which applies to everyone else.

“It’s far more important that in the 30 years since I brought the story to public attention, LGBT rights movements have succeeded with a complete change in the law – for all. So, for me, this symbolic action adds nothing.

“A more substantial action would be the release of files on Turing’s secret work for GCHQ in the cold war. Loss of security clearance, state distrust and surveillance may have been crucial factors in the two years leading up to his death in 1954.”

A little plug for BINSIC


Thirty-three years ago my brother and I got a new cassette player for Christmas.

That allowed us to write and save games for our ZX80 computer (like many ZX80 owners we found that an older cassette player just didn’t work) and in 1981 I wrote a Z80 machine code version of Conway’s Game of Life – still my proudest programming achievement.

Last year I sought to complete the circle by writing an interpreter/DSL that mimiced ZX80 BASIC and called it BINSIC as “BINSIC Is Not Sinclair Instruction Code” and then wrote a BASIC version of the Game of Life (see the screenshot below).

Game of Life in BINSICSo, if you are from that generation computer users, or even if you are not, why not give it a try – more details here: BINSIC

More on the Riemann sphere


In my previous discussion on the Riemann Sphere I talked of it as a closed sphere of infinite extent, but it can also be represented as a sphere of finite extent.

Display of complex number on the Riemann sphere.
Display of complex number on the Riemann sphere. (Photo credit: Wikipedia)

In this case imagine a sphere centred on the origin and of unit radius. Then any complex number can be represented as a point on the sphere’s surface by simply drawing a line from the pole to that point on the (complex) plane – where the line intersects the sphere is the point that represents the number (numbers of magnitude of less than one are represented by the underside of the sphere).

This is, in one sense, a consequence of the infinity of the continuum – we are projecting an infinite line on to a finite line, but as the points on both lines have the same cardinality there can be a one to one correspondence – it’s another thing of beauty, I think.

Doyle Spiral on Riemann Sphere from fedcomite (Flickr)

The beauty of the Riemann sphere


renering of the graph of the Sphere of Rieman
rendering of the graph of the Sphere of Riemann (Photo credit: Wikipedia)

Reading Elliptic Tales: Curves, Counting, and Number Theory today, I came across the Riemann Sphere – a thing of beauty: a closed surface of infinite extent.

I will explain the maths of the sphere in a moment, but I am left wondering if one of two things apply:

(a) We are terrible at teaching maths in this country because although it produces such beautiful concepts too many children – perhaps a majority – think it is a thing of drudgery; or

(b) The appeal of maths is inherently limited and the number of people fascinated by a concept as seemingly contradictory as a closed surface of infinite extent is always going to be limited.

Anyway, here’s an explanation of the sphere…

We want to think about infinity but we will, at first, restrict our thinking to a single, straight line. We can say how far along this line we are by measuring the distance from an arbitrary starting point. So, we could be 1 unit along, \pi units along or -\pi units along and so on.

In fact we can represent where we are on the line by a ratio \frac{a}{b} and represent each point in the form a:b . So \pi can be \pi:1 for instance, so we are not restricted to the rationals in this form.

Now , think of this as one way to get further and further from our arbitrary starting point – first we go 1 unit and so are at 1:1 then we go to 1:\frac{1}{2} then \frac{1}{3} and so on – and we approach the limit 1:0 which is the furthest we can possibly be and which we call infinitely far away from our starting point.

But what if we went in the other (negative) direction? Then we’d go to 1:-1 , 1:-\frac{1}{2} and so on … but at the limit we’d go to 1:0 – exactly the same place we’d go to if we started in the opposite direction: hence the “negative” and “positive” infinities are the same and this line is actually a loop.

To go from this loop to the sphere we need to consider complex numbers. Here every number is of the form a + b\Im, where \Im is the square root of -1.

Then we have to plot numbers on a plane, not a line and so every number has two co-ordinates, one of the form we discussed above in the form of the reals – eg 1:\frac{1}{7} and one along the imaginary axis of the form \Im:\frac{1}{7} – we thus we get an infinite set of infinite closed loops – a sphere.

In the steps of László Bélády


Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….

 

In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.

These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.

But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.

This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.

I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.

Give yourself a Christmas present: learn sed


English: A Shebang, also Hashbang or Sharp ban...
A Shebang, also Hashbang or Sharp bang. (Photo credit: Wikipedia)

Text is at the core of The Unix Way – and all True Unix Hackers work from the command line. This much you know.

(If you don’t get a copy of The Art of Unix Programming – there is an awful lot of rubbish in that book but it does do one thing well: explain the deep connection between text and Unix.)

In a practical sense this means to get the best from your Unix system (and this includes you if you are a Mac OSX user) you need to boost your command line skills. The first thing to do is, of course, become familiar with a text editor – either vi or emacs (I am a vi user, but refuse to engage in a religious war on this matter.)

Then, perhaps not the next thing, but one of the next things you should do is learn sed – the streaming editor – one of the many gifts to the world (including Unix, of course) from Bell Labs (I recently read The Idea Factory: Bell Labs and the Great Age of American Innovation and I suppose I really ought to get around to writing a review of that).

Sed comes from the 1970s, but as so often in computing, it feels to me that its time has come again – in the era of big data a program that allows you to edit a file one line at a time – as opposed to trying to read as much of a file as possible into your computer’s memory – has come round again.

If you are sufficiently long in the tooth to have messed about with Microsoft’s edlin or any other line editor you might be forgiven for giving a hollow laugh at this point – but sed is a tool that genuinely repays the effort you have to make to learn it.

In the last few weeks I have been messing about with 220GB XML files and even the University of York’s big iron compute server cannot handle a buffered edit of a file that size – sed is the only realistic alternative (actually I thought about using my own hex editor – hexxed – which is also essentially a line editor, but a hex editor is really for messing about with binary files and I wouldn’t recommend it.

Sed has allowed me to fix errors deep inside very large files with just a few commands – eg:

LANG=C sed ‘51815253s@^.*$@<instruction address=\’004cf024\’ size=’03’ />@’ infile.xml >outfile.xml

Fixes line 51,815,253 in my file (the line identified by an XML fatal error). Earlier I had executed another line of sed to see what was wrong with that line:

LANG=C sed -n ‘51815253p’ infile.xml

(The LANG=C prefix is because the breakage involved an alien locale seemingly being injected into my file.)

Sed allows you to do much more – for instance anything you can identify through a pattern can be altered. Let’s say you have (text) documents with your old email address – me@oldaddress.com – and you want to change that to your new address – me@newaddress.com …

sed ‘s/me@oldaddress\.com/me@newaddress\.com/g’ mytext.txt > newtext.txt

Then check newtext.txt for correctness before using mv to replace the original.

But there is much, much more you can do with it.

Plus you get real cred as a Unix hacker if you know it.

Now, too many programs these days – especially anything from Redmond – go out of their way to suppress text formats. Text, after all, is resistant to the “embrace and extend” methodology – text wants to be free. But there is plenty of it out there still.

Books that teach you about sed are not so plentiful – I have been reading an old edition of sed & awk – which seems to be out of print – though you can buy a second hand copy for less than a quid excluding postage costs. Well worth the investment, I’d say.

A conundrum for a sed wizard


I am hoping to get maximum exposure for this problem, so I get a solution.

I have a large XML file with one corrupt line – line 35,185,222

A very simple sed script prints out the broken line – as a single line (this is important!):

sed -n '35185222p' infile.xml

Gives this as output:

<load address=’11c1�����ze=’08’ />

But if I change my sed script (sticking it in a file to avoid having to escape the quotes) like so:

35185222s@^.*$@<load address=’11c1385b’ size=’08’ />@p

sed -n -f seddy.sed infile.xml

The script fails to match – because sed sees line 35185222 ending with the first corrupt character.

I know this because a script like:

35185222s@^.*@XXX@p

will output

XXX�����ze=’08’ />

So how do I fix this?

 

Update: I have been reading sed & awk but have only just got the the c command – which I could have used. But I was also interested in a real fix – and thanks to Hugh and others in the comments I now know that comes from specifying the locale with LANG=c.