Modelling the US Presidential Election in R


In many ways this is a lot easier than modelling the British Labour Party leadership election – we only care about two candidates, Biden and Trump, and they capture the vast majority of the votes, so I feel reasonably confident about my first simplification: ignoring the votes of all others and simply focusing on the Democrats and the Republicans.

The next simplifying assumption, though, is the big one and one for which the case is disputed: that a US-wide poll can allow us to predict state-by-state results. What I don’t do is simply plug a nationwide opinion poll into a uniform swing model, though.

Rather than just use such a simple arithmetic system I instead simulate the election 3,000,000 times – and each time randomly (scholastically) vary the inputs:

  • First of all I vary the Biden/Democrat score to simulate sampling error in the poll – assuming samples are normally distributed around the reported number and that 95% of all samples will give a result +/- 3% of the reported number. (As we are ignoring all third party effects I simply add the inverse of this variation to the Republican score).
  • Secondly, state by state, I repeat this process but with a bigger range – instead of a standard deviation of about 1.5% for the poll, I use 2% for the state.
  • Thirdly – and this is entirely subjective – I make an estimate of the long-term trend in a given state and use this as a way of adding further randomness: e.g. if I assume that there has been a ‘core’ 0.5% shift to the Democrats then an additional random factor is added using 0.005 as the standard deviation (in theory, of course, this could be a negative, though that is unlikely).
  • Then – to calculate the likely outcome I look at the probabilities of the different totals of Electoral Votes across all three million simulations.

The states I think are swinging to the Democrats (all 0.5% swings unless stated otherwise) are:

Alabama, Alaska, Arizona, California, Colorado (0.2%), Delaware, DC (0.01%), Florida, Georgia, Hawaii (0.2%), Idaho, Illinois, Kansas, Louisiana, District 1 Maine (0.1%), Maryland, Massachusetts, Mississippi, Montana (0.2%), Nebraska (0.2%), District 2 Nebraska, Nevada (0.1%), New Mexico (0.2%), New York (0.1%), North Carolina, Oklahoma (0.1%), Oregon, South Carolina (0.1%), Texas, Utah, Vermont, Virginia.

Swinging to the Repubicans (again 0.5% unless stated otherwise):

Arkansas (0.1%), Connecticut (0.2%), Indiana, Iowa, Kentucky (0.2%), Maine, District 2 Maine, Michigan, Minnesota, Missouri, District 1 Nebraska (0.1%), District 3 Nebraska (0.1%), New Hampshire (0.2%), New Jersey (0.1%), North Dakota, Ohio, Pennsylvania, Rhode Island, South Dakota, Tennessee (0.2%), West Virginia, Wisconsin, Wyoming.

(Obviously a lot of these swings make no real difference because the state is locked up for one side or the other anyway.)

So these are the sort of results generated (based on YouGov America poll published on 30 August which had Biden at 47% and Trump at 41%) – as you can see there is a massive amount of spurious precision:

Alabama : Biden has 0.0005333333 % chance of winning
Alaska : Biden has 2.1635 % chance of winning
Arizona : Biden has 62.34907 % chance of winning
Arkansas : Biden has 0.0003333333 % chance of winning
California : Biden has 100 % chance of winning
Colorado : Biden has 98.54147 % chance of winning
Connecticut : Biden has 99.985 % chance of winning
Delaware : Biden has 99.96987 % chance of winning
DC : Biden has 100 % chance of winning
Florida : Biden has 75.8858 % chance of winning
Georgia : Biden has 47.0369 % chance of winning
Hawaii : Biden has 100 % chance of winning
Idaho : Biden has 0 % chance of winning
Illinois : Biden has 99.9998 % chance of winning
Indiana : Biden has 0.07203333 % chance of winning
Iowa : Biden has 5.133933 % chance of winning
Kansas : Biden has 0.07423333 % chance of winning
Kentucky : Biden has 6.666667e-05 % chance of winning
Louisiana : Biden has 0.2554 % chance of winning
Maine : Biden has 93.11677 % chance of winning
Maine1 : Biden has 99.99787 % chance of winning
Maine2 : Biden has 10.71063 % chance of winning
Maryland : Biden has 100 % chance of winning
Massachusetts : Biden has 100 % chance of winning
Michigan : Biden has 76.00387 % chance of winning
Minnesota : Biden has 81.57777 % chance of winning
Mississippi : Biden has 0.8029667 % chance of winning
Missouri : Biden has 0.0724 % chance of winning
Montana : Biden has 0.155 % chance of winning
Nebraska : Biden has 0 % chance of winning
Nebraska1 : Biden has 0.0053 % chance of winning
Nebraska2 : Biden has 75.95797 % chance of winning
Nebraska3 : Biden has 0 % chance of winning
Nevada : Biden has 91.20317 % chance of winning
New_Hampshire : Biden has 79.82397 % chance of winning
New_Jersey : Biden has 99.98613 % chance of winning
New_Mexico : Biden has 99.8471 % chance of winning
New_York : Biden has 99.99997 % chance of winning
North_Carolina : Biden has 62.3821 % chance of winning
North_Dakota : Biden has 0 % chance of winning
Ohio : Biden has 19.68823 % chance of winning
Oklahoma : Biden has 0 % chance of winning
Oregon : Biden has 99.97153 % chance of winning
Pennsylvania : Biden has 62.3612 % chance of winning
Rhode_Island : Biden has 99.99227 % chance of winning
South_Carolina : Biden has 3.431833 % chance of winning
South_Dakota : Biden has 0 % chance of winning
Tennessee : Biden has 3.333333e-05 % chance of winning
Texas : Biden has 19.67437 % chance of winning
Utah : Biden has 0.0032 % chance of winning
Vermont : Biden has 100 % chance of winning
Virginia : Biden has 98.81823 % chance of winning
Washington : Biden has 99.99973 % chance of winning
West_Virginia : Biden has 0 % chance of winning
Wisconsin : Biden has 62.35563 % chance of winning
Wyoming : Biden has 0 % chance of winning

In terms of Electoral Votes we have this:

Here the red vertical line is the winning post of 270 EVs and this shows Biden has a slightly greater than 80% chance of getting there.

What does that mean? Well, you can think of it like the weather forecast (indeed the methodologies are similar) – if the Met Office said there was an 80% chance of it raining, would you wear a rain coat?

The blue line gives us an expectation for the number of EVs Biden will win – in this case just under 310.

Here’s the (slightly scrappy) code – unlike the Labour Party code I used earlier in the year I am making proper use of R’s vectorisation capabilities, so although I am running a lot of simulations it only takes a few seconds. A github repo (with the baseline data) will follow in due course.

(If you want to know more about statistics I cannot recommend this book too highly.)




#!/usr/bin/env Rscript

#library("ggplot2")

#args<-commandArgs("trailingOnly=TRUE")

samples<-3000000
dem<-0.47
rep<-0.41
route270<-data.frame(State=as.character(0), EVs=as.integer(0), Chance=as.double(0), stringsAsFactors=FALSE)
theScore<-data.frame(EVs=as.integer(0))

dem16<-0.511
rep16<-0.489

total2p<-(dem+rep)
corDem<-(dem/total2p)
corRep<-(rep/total2p)

swing<-corDem - dem16

us2016<-read.csv(file='US.csv', stringsAsFactors = FALSE)

demDiff<-dem-rnorm(samples, dem, 0.03/1.96)
repDiff<--demDiff


#state odds
for (i in 1:nrow(us2016))
{
  stateDiff<-rnorm(samples, 0, 0.02)
  #generate additional factor
  reverse <- 1
  trendFactor<-us2016[i, ]$Trend
  localDemDiff<-rnorm(samples, trendFactor/100, abs(trendFactor)/100)
  localRepDiff = -localDemDiff
  demProjection<-us2016[i,]$D16 + swing * 100 + demDiff * 100 + localDemDiff * 100 + stateDiff * 100
  repProjection<-us2016[i,]$R16 - swing * 100 + repDiff * 100 + localRepDiff * 100 - stateDiff * 100
  demVote<-us2016[i,]$Turnout * demProjection/100
  repVote<-us2016[i,]$Turnout * repProjection/100
  demVictoryMargin<-demVote - repVote
  demWin<-(demVictoryMargin > 0) * us2016[i,]$EVs
  theScore<-cbind(theScore, demWin)
  z<-sum(demVictoryMargin > 0)
  if (z > 0) {
    route270<-rbind(route270, c(as.character(us2016[i,]$State), as.integer(us2016[i,]$EVs), as.double(z/samples)))
  }
  cat(as.character(us2016[i,]$State),": Biden has ", z/(samples/100), "% chance of winning\n")
}
route270<-route270[-1,]
theScore<-theScore[-1,]
answers<-rowSums(theScore)
rr<-ecdf(answers)
plot(rr, main="Biden electoral votes", xlab="Electoral votes", ylab="Cumulative probability")
abline(v=270, col="red")
abline(h=0.5, col="blue")

Curses on ncurses


gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!

My first R program


Having used Groovy (which makes the scripting environment feel familiar) and some Scheme (via Structure and Interpretation of Computer Programs), R does feel completely alien, but it still feels like a steep learning curve.

But here’s my short script –

unpatched <- read.csv("~/unpatched.txt")
unpatchcons <- transform(unpatched, realm=realm*60 + reals)
attach(unpatchcons)
linelog<-lm(realm~size)
plot(size, realm, log="y")
abline(reg=linelog, untf=TRUE, col="blue",lty=3)
detach(unpatchcons)

And here’s the graph (of Linux kernel compile times) it generates – the blue line is obviously a very bad fit!

Linux kernel compile times

I love R already, but I have not got a clue how to use it


Logo for R
Image via Wikipedia

Thanks to Professor Paul A. Rubin, I have discovered GNU R and already I love it and think it is just the tool I need to plot a few of the graphs (of turnaround times for tasks) in my MSc project.

But there is one problem – I don’t really know how to use it and the online notes do not seem to me to be written in a particularly helpful, tutorial-like, way. All I want to do is plot some data, and do not really want to do it interactively either.

Still, I guess it will come with time. I have ordered this – R in a Nutshell – but I am not sure that is going to do the job, either.

Try and fail, try again and fail better.