**I started doing this for fun and that’s still my motivation – so please do not take this seriously and even if I do slip into using the word “prediction”, above all – this is not a prediction.**

Anyway my aim is to model the potential outcome of the first round of the ballot of the Labour leadership election using the concrete data that we actually have – namely Labour membership data (newly disclosed to the Daily Mirror) and nominations made by all-member meetings (as reported by the @CLPNominations twitter account). The model is built using the R programming language and the code is available below.

So dealing with the assumptions made…

**On membership** – unlike before I now have an up-to-date figure for membership and it’s easy to look up the number of CLPs in each region/country and therefore get an average membership. But what I also do is now distribute the modelled memberships as a Gaussian (normal) distribution around this average (in layperson’s terms I assume there is a range of higher and lower memberships clustered around this average in a bell curve shape). Total arbitrarily I chose the standard deviation of this distribution (a measure of how broad the curve is) as 40% of the mean.

(Never tire of recommending this brilliant book – Statistics Without Tears – for anyone with more than a passing interest in polling and sampling.)

Why does this matter? It’s easier, if support is randomly distributed, for relatively less supported candidates to win a nomination in a smaller CLP and vice versa.

**On nominations** – having robustly held up for most of the week the simple Zipf model I have been using for the nominations started to creak a bit last night – essentially Lisa Nandy, despite reports of some very good performances in terms of votes won, underperformed relative to the model to the benefit of Rebecca Long-Bailey (Keir Starmer kept his proportional share steady). Emily Thornberry had a bad night.

However I am going to cheat a little bit – juggle the coefficient and the rankings – the coefficient falls to 1.1 (which might indicate that the *rate* at which Starmer’s lead is increasing is slowing but still means that *it is increasing*) and drop Nandy to 4th place (Thornberry falls to 10th). And hope that the weekend – when I expect many more nominations – makes it all a bit clearer. It’s a kludge but we aren’t taking this all **that** seriously, are we?

Then we can estimate that if every CLP made a nomination Starmer would take 360, Long-Bailey 172, Nandy 80 and Thornberry 29.

What we want to do is get our model of support to match (reasonably closely) this outcome – but you may have noticed I’ve had to cheat again – because many (but by no means all) of the nomination meetings have been decided by preference balloting because one candidate has not polled at least 50% + 1 vote on the first ballot. Them’s the breaks I’m afraid – modelling the preference voting requires making political decisions which go well beyond this simple maths model and that’s not my purpose here. So I am just treating all of this as though it was a first-past-the-post process.

By trial and error I have found that setting the shares of support to the figures below gives a pretty good match:**Starmer** 26.75%**Long-Bailey** 25.65%**Nandy** 24.70%**Thornberry** 22.90%

Obviously these are very tightly grouped results – and that reflects another deep flaw in the model I’m afraid – we make no allowances at all for clusters of regional support and so have to try to draw out the result from a fully random distribution – so, for instance, if we thought that Nandy or Long-Bailey could pick up a lot of nominations in their home region (the North West) then they they might be able to hit our target with lower levels of support (the same applies to Starmer in London). But that is a level of sophistication beyond this model.

The figures above typically generate a result like this:

[1] “In London, Starmer won 43, RLB won 21, Nandy won 9, Thornberry won 0”

[1] “In Scotland, Starmer won 36, RLB won 21, Nandy won 12, Thornberry won 4”

[1] “In WMids, Starmer won 29, RLB won 19, Nandy won 5, Thornberry won 6”

[1] “In EMids, Starmer won 26, RLB won 11, Nandy won 7, Thornberry won 2”

[1] “In Yorkshire, Starmer won 28, RLB won 19, Nandy won 5, Thornberry won 2”

[1] “In North, Starmer won 21, RLB won 3, Nandy won 5, Thornberry won 0”

[1] “In NWest, Starmer won 43, RLB won 20, Nandy won 11, Thornberry won 0”

[1] “In SEast, Starmer won 51, RLB won 21, Nandy won 9, Thornberry won 3”

[1] “In Eastern, Starmer won 36, RLB won 12, Nandy won 10, Thornberry won 0”

[1] “In SWest, Starmer won 35, RLB won 14, Nandy won 4, Thornberry won 2”

[1] “In Wales, Starmer won 16, RLB won 18, Nandy won 4, Thornberry won 2”

[1] “Starmer won 364, RLB won 179, Nandy won 81, Thornberry won 21

But we can go further now and look at the range of outcomes grouped around these shares – in other words use some “Monte Carlo methods” to estimate what the probabilities of certain outcomes are.

To do this here we use the ‘predicted’ shares above as the mean of a normal distribution (with standard deviation of 1% in each case. In simple terms that means that while our central case is that Starmer has the support of 26.75% of members, we might expect that in roughly one case in six he has support of less than 25.75% and in one case in six he has more than than 27.75% – and similar stipulations apply to the other candidates. We then run the simulation 1000 times and look at the distribution of outcomes.

In fact the (lazy, but it’s only for fun) way I have done this means that the variation is likely to be bigger for Long-Bailey and Nandy than for Starmer – Starmer’s support can go up and down but only moves at one end – if Starmer’s support falls and Long-Bailey’s rises she can get up to double the benefit. This is an artefact of the way I have coded this up but I will keep it because (a) I don’t want to wait another hour to finish this by re-running the code (R is great but nobody has ever suggested it is fast) and (b) Starmer is the favourite so he should feel a bit more pressure! The difference can be seen in the shape of the density curves for the candidates in the featured image for this page – Long-Bailey’s and Nandy’s are broader and shorter curves show their results taking a broader range of answers even if Starmer’s mean is well ahead.

I have assumed a 60% turnout and so we get Starmer’s maximum vote as 99,319 and his minimum as 76,685. For Long-Bailey the figures are 97,937 and 69,396, and Nandy 97,011 and 66,246. The notional results for Thornberry are 86,034 and 63,672.

**Another problem**…barring a big change in circumstances Emily Thornberry won’t be on the final ballot – she’s not getting anything like enough nominations – and so her supporters will have the choice of voting for someone else or just not bothering. Again this is a political question and not one for here.

This “Thornberry problem” makes what follows pretty worthless, unfortunately – certainly if it really is the case that there are 60,000 – 70,000 would-be Thornberry voters out there who will be forced to do something they would prefer not to… but here is the non-predictive prediction:**Keir Starmer** has a 61.9% chance of topping the poll, **Rebecca Long-Bailey** has a 28.2% chance and **Lisa Nandy** has a 9.9% chance of winning the first ballot. Emily Thornberry does not top the poll in any of the 1000 simulations run.

Code is below – have fun with it.

#!/usr/bin/env Rscript region<-c('London', 'Scotland', 'WMids', 'EMids', 'Yorkshire', 'North', 'NWest', 'SEast', 'Eastern', 'SWest', 'Wales') Membership<-c(115943, 20123, 39296, 34001, 50562, 27971, 73250, 66183, 40943, 46530, 26894) CLPs<-c(73, 73, 59, 46, 54, 29, 74, 84, 58, 55, 40) sharesX<-c(0.2675, 0.524, 0.771, 1.0) results.data<-data.frame(Starmer = integer(), RLB = integer(), Nandy = integer(), Thornberry = integer(), stringsAsFactors=FALSE) for (x in 1:1000) { starmerShare<-rnorm(1, sharesX[1], 0.01) rlbShare<-rnorm(1, sharesX[2], 0.01) nandyShare<-rnorm(1, sharesX[3], 0.01) shares<-c(starmerShare, rlbShare, nandyShare, 1.0) starmerW<-0 rlbW<-0 nandyW<-0 etW<-0 votesStarmer<-0 votesRLB<-0 votesNandy<-0 votesET<-0 for (reg in 1:11) { nameRegion<-region[reg] starmerC<-0 rlbC<-0 nandyC<-0 etC<-0 avMembership<-Membership[reg]/CLPs[reg] distMembership<-rnorm(CLPs[reg], avMembership, avMembership/2.5) for (p in 1:CLPs[reg]) { starmerV<-0 rlbV<-0 nandyV<-0 etV<-0 for (v in 1:distMembership[p]) { turnout<-runif(1) if (turnout > 0.6) { next } ans<-runif(1) if (ans <= shares[1]) { starmerV = starmerV + 1 next } if (ans <= shares[2]) { rlbV = rlbV + 1 next } if (ans <= shares[3]) { nandyV = nandyV + 1 next } etV = etV + 1 } votesStarmer<-votesStarmer + starmerV votesRLB<-votesRLB + rlbV votesNandy<-votesNandy + nandyV votesET<-votesET + etV if (max(starmerV, rlbV, nandyV, etV) == starmerV) { starmerC = starmerC + 1 starmerW = starmerW + 1 next } if (max(rlbV, nandyV, etV) == rlbV) { rlbC = rlbC + 1 rlbW = rlbW + 1 next } if (max(nandyV, etV) == nandyV) { nandyC = nandyC + 1 nandyW = nandyW + 1 next } etC = etC + 1 etW = etW + 1 } regionalResult<-sprintf( "In %s, Starmer won %i, RLB won %i, Nandy won %i, Thornberry won %i", region[reg], starmerC, rlbC, nandyC, etC) print(regionalResult) } result<-sprintf( "Starmer won %i, RLB won %i, Nandy won %i, Thornberry won %i \n", starmerW, rlbW, nandyW, etW); print(result) votesOutcomes<-sprintf("Starmer: %i RLB: %i Nandy: %i Thornberry: %i", votesStarmer, votesRLB, votesNandy, votesET) print(votesOutcomes) results.data<-rbind(results.data, c(votesStarmer, votesRLB, votesNandy, votesET)) } names(results.data)=c('Starmer', 'RLB', 'Nandy', 'Thornberry')