Modelling the US Presidential Election in R


In many ways this is a lot easier than modelling the British Labour Party leadership election – we only care about two candidates, Biden and Trump, and they capture the vast majority of the votes, so I feel reasonably confident about my first simplification: ignoring the votes of all others and simply focusing on the Democrats and the Republicans.

The next simplifying assumption, though, is the big one and one for which the case is disputed: that a US-wide poll can allow us to predict state-by-state results. What I don’t do is simply plug a nationwide opinion poll into a uniform swing model, though.

Rather than just use such a simple arithmetic system I instead simulate the election 3,000,000 times – and each time randomly (scholastically) vary the inputs:

  • First of all I vary the Biden/Democrat score to simulate sampling error in the poll – assuming samples are normally distributed around the reported number and that 95% of all samples will give a result +/- 3% of the reported number. (As we are ignoring all third party effects I simply add the inverse of this variation to the Republican score).
  • Secondly, state by state, I repeat this process but with a bigger range – instead of a standard deviation of about 1.5% for the poll, I use 2% for the state.
  • Thirdly – and this is entirely subjective – I make an estimate of the long-term trend in a given state and use this as a way of adding further randomness: e.g. if I assume that there has been a ‘core’ 0.5% shift to the Democrats then an additional random factor is added using 0.005 as the standard deviation (in theory, of course, this could be a negative, though that is unlikely).
  • Then – to calculate the likely outcome I look at the probabilities of the different totals of Electoral Votes across all three million simulations.

The states I think are swinging to the Democrats (all 0.5% swings unless stated otherwise) are:

Alabama, Alaska, Arizona, California, Colorado (0.2%), Delaware, DC (0.01%), Florida, Georgia, Hawaii (0.2%), Idaho, Illinois, Kansas, Louisiana, District 1 Maine (0.1%), Maryland, Massachusetts, Mississippi, Montana (0.2%), Nebraska (0.2%), District 2 Nebraska, Nevada (0.1%), New Mexico (0.2%), New York (0.1%), North Carolina, Oklahoma (0.1%), Oregon, South Carolina (0.1%), Texas, Utah, Vermont, Virginia.

Swinging to the Repubicans (again 0.5% unless stated otherwise):

Arkansas (0.1%), Connecticut (0.2%), Indiana, Iowa, Kentucky (0.2%), Maine, District 2 Maine, Michigan, Minnesota, Missouri, District 1 Nebraska (0.1%), District 3 Nebraska (0.1%), New Hampshire (0.2%), New Jersey (0.1%), North Dakota, Ohio, Pennsylvania, Rhode Island, South Dakota, Tennessee (0.2%), West Virginia, Wisconsin, Wyoming.

(Obviously a lot of these swings make no real difference because the state is locked up for one side or the other anyway.)

So these are the sort of results generated (based on YouGov America poll published on 30 August which had Biden at 47% and Trump at 41%) – as you can see there is a massive amount of spurious precision:

Alabama : Biden has 0.0005333333 % chance of winning
Alaska : Biden has 2.1635 % chance of winning
Arizona : Biden has 62.34907 % chance of winning
Arkansas : Biden has 0.0003333333 % chance of winning
California : Biden has 100 % chance of winning
Colorado : Biden has 98.54147 % chance of winning
Connecticut : Biden has 99.985 % chance of winning
Delaware : Biden has 99.96987 % chance of winning
DC : Biden has 100 % chance of winning
Florida : Biden has 75.8858 % chance of winning
Georgia : Biden has 47.0369 % chance of winning
Hawaii : Biden has 100 % chance of winning
Idaho : Biden has 0 % chance of winning
Illinois : Biden has 99.9998 % chance of winning
Indiana : Biden has 0.07203333 % chance of winning
Iowa : Biden has 5.133933 % chance of winning
Kansas : Biden has 0.07423333 % chance of winning
Kentucky : Biden has 6.666667e-05 % chance of winning
Louisiana : Biden has 0.2554 % chance of winning
Maine : Biden has 93.11677 % chance of winning
Maine1 : Biden has 99.99787 % chance of winning
Maine2 : Biden has 10.71063 % chance of winning
Maryland : Biden has 100 % chance of winning
Massachusetts : Biden has 100 % chance of winning
Michigan : Biden has 76.00387 % chance of winning
Minnesota : Biden has 81.57777 % chance of winning
Mississippi : Biden has 0.8029667 % chance of winning
Missouri : Biden has 0.0724 % chance of winning
Montana : Biden has 0.155 % chance of winning
Nebraska : Biden has 0 % chance of winning
Nebraska1 : Biden has 0.0053 % chance of winning
Nebraska2 : Biden has 75.95797 % chance of winning
Nebraska3 : Biden has 0 % chance of winning
Nevada : Biden has 91.20317 % chance of winning
New_Hampshire : Biden has 79.82397 % chance of winning
New_Jersey : Biden has 99.98613 % chance of winning
New_Mexico : Biden has 99.8471 % chance of winning
New_York : Biden has 99.99997 % chance of winning
North_Carolina : Biden has 62.3821 % chance of winning
North_Dakota : Biden has 0 % chance of winning
Ohio : Biden has 19.68823 % chance of winning
Oklahoma : Biden has 0 % chance of winning
Oregon : Biden has 99.97153 % chance of winning
Pennsylvania : Biden has 62.3612 % chance of winning
Rhode_Island : Biden has 99.99227 % chance of winning
South_Carolina : Biden has 3.431833 % chance of winning
South_Dakota : Biden has 0 % chance of winning
Tennessee : Biden has 3.333333e-05 % chance of winning
Texas : Biden has 19.67437 % chance of winning
Utah : Biden has 0.0032 % chance of winning
Vermont : Biden has 100 % chance of winning
Virginia : Biden has 98.81823 % chance of winning
Washington : Biden has 99.99973 % chance of winning
West_Virginia : Biden has 0 % chance of winning
Wisconsin : Biden has 62.35563 % chance of winning
Wyoming : Biden has 0 % chance of winning

In terms of Electoral Votes we have this:

Here the red vertical line is the winning post of 270 EVs and this shows Biden has a slightly greater than 80% chance of getting there.

What does that mean? Well, you can think of it like the weather forecast (indeed the methodologies are similar) – if the Met Office said there was an 80% chance of it raining, would you wear a rain coat?

The blue line gives us an expectation for the number of EVs Biden will win – in this case just under 310.

Here’s the (slightly scrappy) code – unlike the Labour Party code I used earlier in the year I am making proper use of R’s vectorisation capabilities, so although I am running a lot of simulations it only takes a few seconds. A github repo (with the baseline data) will follow in due course.

(If you want to know more about statistics I cannot recommend this book too highly.)




#!/usr/bin/env Rscript

#library("ggplot2")

#args<-commandArgs("trailingOnly=TRUE")

samples<-3000000
dem<-0.47
rep<-0.41
route270<-data.frame(State=as.character(0), EVs=as.integer(0), Chance=as.double(0), stringsAsFactors=FALSE)
theScore<-data.frame(EVs=as.integer(0))

dem16<-0.511
rep16<-0.489

total2p<-(dem+rep)
corDem<-(dem/total2p)
corRep<-(rep/total2p)

swing<-corDem - dem16

us2016<-read.csv(file='US.csv', stringsAsFactors = FALSE)

demDiff<-dem-rnorm(samples, dem, 0.03/1.96)
repDiff<--demDiff


#state odds
for (i in 1:nrow(us2016))
{
  stateDiff<-rnorm(samples, 0, 0.02)
  #generate additional factor
  reverse <- 1
  trendFactor<-us2016[i, ]$Trend
  localDemDiff<-rnorm(samples, trendFactor/100, abs(trendFactor)/100)
  localRepDiff = -localDemDiff
  demProjection<-us2016[i,]$D16 + swing * 100 + demDiff * 100 + localDemDiff * 100 + stateDiff * 100
  repProjection<-us2016[i,]$R16 - swing * 100 + repDiff * 100 + localRepDiff * 100 - stateDiff * 100
  demVote<-us2016[i,]$Turnout * demProjection/100
  repVote<-us2016[i,]$Turnout * repProjection/100
  demVictoryMargin<-demVote - repVote
  demWin<-(demVictoryMargin > 0) * us2016[i,]$EVs
  theScore<-cbind(theScore, demWin)
  z<-sum(demVictoryMargin > 0)
  if (z > 0) {
    route270<-rbind(route270, c(as.character(us2016[i,]$State), as.integer(us2016[i,]$EVs), as.double(z/samples)))
  }
  cat(as.character(us2016[i,]$State),": Biden has ", z/(samples/100), "% chance of winning\n")
}
route270<-route270[-1,]
theScore<-theScore[-1,]
answers<-rowSums(theScore)
rr<-ecdf(answers)
plot(rr, main="Biden electoral votes", xlab="Electoral votes", ylab="Cumulative probability")
abline(v=270, col="red")
abline(h=0.5, col="blue")

Ten Great Ideas About Chance: A Review


Unfortunately Ten Great Ideas About Chance is a disappointment.

The central idea of the book is to look at ten key mathematical-philosophical ideas in probability and, using the history of the idea, explain what they are about and why they matter.

It’s not that the book doesn’t have some very interesting material, but it fails to hit its target over and over again and, unfortunately, even contains some obvious and – presumably – not so obvious errors.

This review states it so much better than I can, so here is an extract:

The chapters are invariably a mix of
1. a trivial example that does not penetrate enough the intended topic because it contains too much of the familiar and too little of the topic that’s being introduced
2. references to original texts that are impenetrable nineteenth century translations into English from eighteenth century originals written in French or German or Latin
3. statements of complex results that would take fifty pages to arrive at if the proofs were shown
4. cheerleading

So what I re-lived by reading this book is my Freshman Year nightmare math class where three times a week I’d follow the first five minutes of the lecture only to subsequently find myself furiously copying from the board so I can read my lecture notes later at home and try to make sense of them.

Learnt this week… 24 January


My friend and former colleague Adam Higgitt every Friday posts a list of “five things I have learned this week”. It’s popular and good fun – especially as Adam is not afraid of an argument if you challenge some of his claims.

For a while I tried to do the same thing myself, but failed miserably.

I am not going to try again, but I am proposing to try something different, if inspired by Adam.

So here is the first list of things “learnt this week” scientific or mathematical facts and amusements. I will aim for five, but this week just did not make it.

1. A random walk can be used to build a binomial distribution – but not a very good one!

Imagine a left-right ruled line centred on zero and a marker than can, in every time step move either left or right be one step where the probability of moving left p_l and of moving right, p_r are both the same: i.e., p_l = p_r = 0.5 . At the “beginning of time” the marker stands at 0.

Then if we count the times the marker is at any given position they will be distributed bionomially (well, as we approach an infinite time). The BASIC code below (which I wrote using BINSIC) should give you an idea (this code runs the risk of an overflow though, of course and the most interesting thing about it is how unlike a binomial distribution the results can be).


10 DIM A(1001)
12 FOR I = 1 TO 1001
14 LET A(I) = 0
16 NEXT I
20 LET POS = 500
30 FOR I = 1 TO 50000
40 LET X = RND * 2
50 IF X > 1 THEN LET POS = POS + 1 ELSE LET POS = POS - 1
60 LET A(POS) = A(POS) + 1
70 NEXT I
80 PRINT "*****BINOMIAL DISTRIBUTION*****"
90 FOR I = 1 TO 1001
95 LET X = I - 500
110 PRINT X," ",A(I)
120 NEXT I

Here’s a chart of the values generated by similar code (actually run for about 70,000 times):
Not much like a binomial distribution2. Things that are isomorphic have a one-to-one relationship

Up to this point I just had an informal “things that look different but are related through a reversible transformation” idea in my head. But that’s not fully correct.

A simple example might be the logarithms. Every real number has a unique logarithm.

Enhanced by Zemanta

Is the time pages are in the working set stochastic?


Reading about the Monte Carlo method has set me thinking about this and how, if at all, it might be applied to page reclaim in the Linux kernel.

In my MSc report I show that my results show that working set size is not normally distributed – despite occasional claims to the contrary in computer science text books. But it is possible that a series of normal distributions are overlaid – see the graphic below:

Working set size for MySQL daemonThe first question is: how do I design an experiment to verify that these are, indeed a series of normal distributions?

(I may find out how I have done in the degree in the next week or so – wish me luck)