Another go at modelling the Labour leadership election


I started doing this for fun and that’s still my motivation – so please do not take this seriously and even if I do slip into using the word “prediction”, above all – this is not a prediction.

Anyway my aim is to model the potential outcome of the first round of the ballot of the Labour leadership election using the concrete data that we actually have – namely Labour membership data (newly disclosed to the Daily Mirror) and nominations made by all-member meetings (as reported by the @CLPNominations twitter account). The model is built using the R programming language and the code is available below.

So dealing with the assumptions made…

On membership – unlike before I now have an up-to-date figure for membership and it’s easy to look up the number of CLPs in each region/country and therefore get an average membership. But what I also do is now distribute the modelled memberships as a Gaussian (normal) distribution around this average (in layperson’s terms I assume there is a range of higher and lower memberships clustered around this average in a bell curve shape). Total arbitrarily I chose the standard deviation of this distribution (a measure of how broad the curve is) as 40% of the mean.

(Never tire of recommending this brilliant book – Statistics Without Tears – for anyone with more than a passing interest in polling and sampling.)

Why does this matter? It’s easier, if support is randomly distributed, for relatively less supported candidates to win a nomination in a smaller CLP and vice versa.

On nominations – having robustly held up for most of the week the simple Zipf model I have been using for the nominations started to creak a bit last night – essentially Lisa Nandy, despite reports of some very good performances in terms of votes won, underperformed relative to the model to the benefit of Rebecca Long-Bailey (Keir Starmer kept his proportional share steady). Emily Thornberry had a bad night.

However I am going to cheat a little bit – juggle the coefficient and the rankings – the coefficient falls to 1.1 (which might indicate that the rate at which Starmer’s lead is increasing is slowing but still means that it is increasing) and drop Nandy to 4th place (Thornberry falls to 10th). And hope that the weekend – when I expect many more nominations – makes it all a bit clearer. It’s a kludge but we aren’t taking this all that seriously, are we?

Then we can estimate that if every CLP made a nomination Starmer would take 360, Long-Bailey 172, Nandy 80 and Thornberry 29.

What we want to do is get our model of support to match (reasonably closely) this outcome – but you may have noticed I’ve had to cheat again – because many (but by no means all) of the nomination meetings have been decided by preference balloting because one candidate has not polled at least 50% + 1 vote on the first ballot. Them’s the breaks I’m afraid – modelling the preference voting requires making political decisions which go well beyond this simple maths model and that’s not my purpose here. So I am just treating all of this as though it was a first-past-the-post process.

By trial and error I have found that setting the shares of support to the figures below gives a pretty good match:

Starmer 26.75%
Long-Bailey 25.65%
Nandy 24.70%
Thornberry 22.90%

Obviously these are very tightly grouped results – and that reflects another deep flaw in the model I’m afraid – we make no allowances at all for clusters of regional support and so have to try to draw out the result from a fully random distribution – so, for instance, if we thought that Nandy or Long-Bailey could pick up a lot of nominations in their home region (the North West) then they they might be able to hit our target with lower levels of support (the same applies to Starmer in London). But that is a level of sophistication beyond this model.

The figures above typically generate a result like this:

[1] “In London, Starmer won 43, RLB won 21, Nandy won 9, Thornberry won 0”

[1] “In Scotland, Starmer won 36, RLB won 21, Nandy won 12, Thornberry won 4”

[1] “In WMids, Starmer won 29, RLB won 19, Nandy won 5, Thornberry won 6”

[1] “In EMids, Starmer won 26, RLB won 11, Nandy won 7, Thornberry won 2”

[1] “In Yorkshire, Starmer won 28, RLB won 19, Nandy won 5, Thornberry won 2”

[1] “In North, Starmer won 21, RLB won 3, Nandy won 5, Thornberry won 0”

[1] “In NWest, Starmer won 43, RLB won 20, Nandy won 11, Thornberry won 0”

[1] “In SEast, Starmer won 51, RLB won 21, Nandy won 9, Thornberry won 3”

[1] “In Eastern, Starmer won 36, RLB won 12, Nandy won 10, Thornberry won 0”

[1] “In SWest, Starmer won 35, RLB won 14, Nandy won 4, Thornberry won 2”

[1] “In Wales, Starmer won 16, RLB won 18, Nandy won 4, Thornberry won 2”

[1] “Starmer won 364, RLB won 179, Nandy won 81, Thornberry won 21

But we can go further now and look at the range of outcomes grouped around these shares – in other words use some “Monte Carlo methods” to estimate what the probabilities of certain outcomes are.

To do this here we use the ‘predicted’ shares above as the mean of a normal distribution (with standard deviation of 1% in each case. In simple terms that means that while our central case is that Starmer has the support of 26.75% of members, we might expect that in roughly one case in six he has support of less than 25.75% and in one case in six he has more than than 27.75% – and similar stipulations apply to the other candidates. We then run the simulation 1000 times and look at the distribution of outcomes.

In fact the (lazy, but it’s only for fun) way I have done this means that the variation is likely to be bigger for Long-Bailey and Nandy than for Starmer – Starmer’s support can go up and down but only moves at one end – if Starmer’s support falls and Long-Bailey’s rises she can get up to double the benefit. This is an artefact of the way I have coded this up but I will keep it because (a) I don’t want to wait another hour to finish this by re-running the code (R is great but nobody has ever suggested it is fast) and (b) Starmer is the favourite so he should feel a bit more pressure! The difference can be seen in the shape of the density curves for the candidates in the featured image for this page – Long-Bailey’s and Nandy’s are broader and shorter curves show their results taking a broader range of answers even if Starmer’s mean is well ahead.

I have assumed a 60% turnout and so we get Starmer’s maximum vote as 99,319 and his minimum as 76,685. For Long-Bailey the figures are 97,937 and 69,396, and Nandy 97,011 and 66,246. The notional results for Thornberry are 86,034 and 63,672.

Another problem…barring a big change in circumstances Emily Thornberry won’t be on the final ballot – she’s not getting anything like enough nominations – and so her supporters will have the choice of voting for someone else or just not bothering. Again this is a political question and not one for here.

This “Thornberry problem” makes what follows pretty worthless, unfortunately – certainly if it really is the case that there are 60,000 – 70,000 would-be Thornberry voters out there who will be forced to do something they would prefer not to… but here is the non-predictive prediction:
Keir Starmer has a 61.9% chance of topping the poll, Rebecca Long-Bailey has a 28.2% chance and Lisa Nandy has a 9.9% chance of winning the first ballot. Emily Thornberry does not top the poll in any of the 1000 simulations run.

Code is below – have fun with it.

#!/usr/bin/env Rscript

region<-c('London', 'Scotland', 'WMids', 'EMids', 'Yorkshire', 
	  'North', 'NWest', 'SEast', 'Eastern', 'SWest', 'Wales')
Membership<-c(115943, 20123, 39296, 34001, 50562, 27971, 
	      73250, 66183, 40943, 46530, 26894)
CLPs<-c(73, 73, 59, 46, 54, 29, 74, 84, 58, 55, 40)

sharesX<-c(0.2675, 0.524, 0.771, 1.0)
results.data<-data.frame(Starmer = integer(), RLB = integer(),
			 Nandy = integer(), Thornberry = integer(), 
			 stringsAsFactors=FALSE)

for (x in 1:1000) {

starmerShare<-rnorm(1, sharesX[1], 0.01)
rlbShare<-rnorm(1, sharesX[2], 0.01)
nandyShare<-rnorm(1, sharesX[3], 0.01)
shares<-c(starmerShare, rlbShare, nandyShare, 1.0)

starmerW<-0
rlbW<-0
nandyW<-0
etW<-0

votesStarmer<-0
votesRLB<-0
votesNandy<-0
votesET<-0

for (reg in 1:11)
{
	nameRegion<-region[reg]
	starmerC<-0
	rlbC<-0
	nandyC<-0
	etC<-0
	avMembership<-Membership[reg]/CLPs[reg]
	distMembership<-rnorm(CLPs[reg], avMembership, avMembership/2.5)
	for (p in 1:CLPs[reg])
	{
		starmerV<-0
		rlbV<-0
		nandyV<-0
		etV<-0
		for (v in 1:distMembership[p])
		{
			turnout<-runif(1)
			if (turnout > 0.6) {
				next
			}
			ans<-runif(1)
			if (ans <= shares[1]) {
				starmerV = starmerV + 1
				next
			}
			if (ans <= shares[2]) {
				rlbV = rlbV + 1
				next
			}
			if (ans <= shares[3]) {
				nandyV = nandyV + 1
				next
			}
			etV = etV + 1
		}
		votesStarmer<-votesStarmer + starmerV
		votesRLB<-votesRLB + rlbV
		votesNandy<-votesNandy + nandyV
		votesET<-votesET + etV
		if (max(starmerV, rlbV, nandyV, etV) == starmerV) {
			starmerC = starmerC + 1
			starmerW = starmerW + 1
			next
		}
		if (max(rlbV, nandyV, etV) == rlbV) {
			rlbC = rlbC + 1
			rlbW = rlbW + 1
			next
		}
		if (max(nandyV, etV) == nandyV) {
			nandyC = nandyC + 1
			nandyW = nandyW + 1
			next
		}
		etC = etC + 1
		etW = etW + 1
	}
	regionalResult<-sprintf(
	"In %s, Starmer won %i, RLB won %i, Nandy won %i, Thornberry won %i",
       	region[reg], starmerC, rlbC, nandyC, etC)
	print(regionalResult)
}
result<-sprintf(
	"Starmer won %i, RLB won %i, Nandy won %i, Thornberry won %i \n",
       	starmerW, rlbW, nandyW, etW);
print(result)
votesOutcomes<-sprintf("Starmer: %i   RLB: %i   Nandy: %i   Thornberry: %i",
		       votesStarmer, votesRLB, votesNandy, votesET)
print(votesOutcomes)
results.data<-rbind(results.data, c(votesStarmer, votesRLB, 
				    votesNandy, votesET))
}
names(results.data)=c('Starmer', 'RLB', 'Nandy', 'Thornberry')

Mathematically modelling the overall Labour result


The Zipf model I outlined here looks to be reasonably robust – though maybe the coefficient needs to drop to somewhere between 1.25 and 1.29 – but can we use this result to draw any conclusions on the actual result itself?

That’s what I am going to try to do here – but be warned there are a whole host of assumptions in here and this isn’t really anything other than a mathematical diversion.

The idea is this: if supporters of any given candidate are randomly distributed across all Constituency Labour Parties (dubious – discuss) and we make certain assumptions about the sizes of Constituency Labour Parties, what level of support tends to generate the sort of results for nomination meetings that we are seeing.

On the size of the 11 Labour party regions and countries we also assume a Zipf distribution and so work on a basis that 339,306 members vote and that in the biggest region (nominally London, but we’re not basing this on real membership figures for London, just using a simple model) that means 120,000 voters and 9223 in the smallest region. These figures decline using a coefficient of 1.07 over the rank of the ‘region’ (1.07 is the figure seen across the globe for major national cities rank).

Each one of these notional regions has 56 CLPs which range in size from 2143 voters for the biggest to 165 at the smallest.

The target we are trying to hit is the Zipf prediction (for a notional 616 nominations) of Starmer 358 nominations, Long-Bailey 145 nominations, Nandy 86 nominations and Thornberry 25 nominations.

OK, you’ve heard all the blah – here’s the bit you really came for – what does it say about support. Well, it’s sort of good news for Keir Starmer who, this model suggests, is getting about 27% support. Rebecca Long-Bailey is picking up 25.5% so is close behind, but Lisa Nandy is not far off either at 25.0%, while Emily Thornberry has 22.5%. On a typical run (as the process is random the precise number varies) this gives Starmer 335 nominations, Long-Bailey 154 nominations, Nandy 101 nominations and Thornberry 26 – the precise figures don’t matter so much beyond showing that it’s close.

Now, YouGov’s poll – which I’d trust much more than my prognostications – had very different figures, with Starmer on 46% first preferences and Long-Bailey on 32%.

So why the difference and why do I trust the poll more than this model?

Firstly and most importantly because support for candidates isn’t randomly distributed – I reason Long-Bailey and Nandy are likely to have disproportionally more supporters in the North West and Starmer in London – and there are many more members in London.

And secondly, because, as I’ve already said, the model makes far too many assumptions.

On the other hand – I do think Nandy has been doing better than the initial polling suggested so this model is probably right to suggest she’s doing relatively well.

Code (in R) used is shown below… but the bottom line is: this guess isn’t likely to be a very good one.

#!/usr/bin/env Rscript

clpSize<-c(2142, 1021, 661, 486, 383, 315, 267, 232, 204, 182, 165)
shares<-c(0.275, 0.525, 0.775, 1.0)

starmerW<-0
rlbW<-0
nandyW<-0
etW<-0

for (reg in 1:11)
{
	starmerC<-0
	rlbC<-0
	nandyC<-0
	etC<-0
	for (p in 1:56)
	{
		starmerV<-0
		rlbV<-0
		nandyV<-0
		etV<-0
		for (v in 1:clpSize[reg])
		{
			ans<-runif(1)
			if (ans <= shares[1]) {
				starmerV = starmerV + 1
				next
			}
			if (ans <= shares[2]) {
				rlbV = rlbV + 1
				next
			}
			if (ans <= shares[3]) {
				nandyV = nandyV + 1
				next
			}
			etV = etV + 1
		}
		if (max(starmerV, rlbV, nandyV, etV) == starmerV) {
			starmerC = starmerC + 1
			starmerW = starmerW + 1
			next
		}
		if (max(rlbV, nandyV, etV) == rlbV) {
			rlbC = rlbC + 1
			rlbW = rlbW + 1
			next
		}
		if (max(nandyV, etV) == nandyV) {
			nandyC = nandyC + 1
			nandyW = nandyW + 1
			next
		}
		etC = etC + 1
		etW = etW + 1
	}
}
result<-sprintf("Starmer won %i, RLB won %i, Nandy won %i, Thornberry won %i \n", starmerW, rlbW, nandyW, etW);
print(result)

Mathematically modelling the Labour leadership nomination race


No politics here – just some maths.

But if we use a Zipf distribution (see here for more about that) we get a pretty good fit for the three front runners – Keir Starmer, who currently has 43 nominations from constituency labour parties, Rebecca Long-Bailey who has 17 and Lisa Nandy who has 10 – if we use a coefficient of 1.35 over their rank.

All three of these are on the ballot anyway because of trade union and other support, so the question is whether fourth placed candidate Emily Thornberry, currently with just three nominations, can make it.

The bad news for her is that this (admittedly simple) model suggests not. Indeed she is already seriously under-performing based on her rank. If the coefficient is correct she ought to be on 7 or 8 nominations – but right now she is performing as if she was in seventh place.

If her performance remains at this level it’s essentially mathematically impossible for her to make the ballot threshold of 33 nominations.

So – a prediction: if (around) 400 CLPs nominate then the model points to 236 for Starmer, 92 for Rebecca Long-Bailey, 53 for Lisa Nandy and 17 for Emily Thornberry.

Update: People better informed than me suggest 400 is a low figure for the number of nominating constituencies and for 500 the figures are: Keir Starmer 295, Rebecca Long-Bailey 116, Lisa Nandy 67 and Emily Thornberry 21. For Thornberry to make the field (on current performance remember) there would have to be 750 nominations – which is about 100 more than the mathematically possible maximum. So either Thornberry’s performance will have to significantly improve or she is out.

Convolutional network (again)


Black king on black square with random filters

With time on my hands I have returned to working on an old project – attempting to build a convolutional network that will solve chess puzzles.

(A convolutional network is a type of neural network – a modelled ‘artificial intelligence’ that can be used to detect patterns or undertake similar tasks.)

Here I am not using ‘AI’ to solve the chess puzzle itself (though there is are very large libraries of chess endings and positions available, so I suppose that would be possible), but to read the chess position in the puzzle.

Thus the task is to classify squares on the board.

I tried this a couple of years ago and got nowhere, but reading this book – “Machine Learning: An Applied Mathematics Introduction” has persuaded me to have another go, reducing the dimensions of the answer I am seeking from 25 to 9 (without any loss of information).

At the moment I am just in the process of building the “feed forward” network – i.e. the neural network that, once trained, will take an image as input and then give a nine-dimensional answer.

These answers can be thought of, perhaps not too accurately but not totally unreasonably, as a measure of likelihood that the input picture falls into a given category (e.g. by giving a number between 0 and 1 under the category of white square, or pawn, or black piece etc.).

The input picture is passed through a series of filters that are designed to extract features of the image and then, at the end, the AI considers all the results and gives its view as to the classification of the image.

In my AI there are 50 fibres (i.e. 50 chains of filters) and the image at the top of the page shows the results of passing the image – a black king on a black square – through the top two layers. So the first 50 images are from the top rank of filters and the bottom from the second rank. I plan to implement another three layers of filters (though of smaller dimensions – the idea being they can concentrate their information) before the final “fully connected” layer (where all 50 fibres exchange information) that delivers the result.

The images here are produced from randomly assigned filters so essentially contain no real “intelligence” at all – but if you magnify the image you’ll see that even these random filters produce interesting results.

Training the network is vital of course – and that’s where it all failed last time. I’m back to reading Springer’s “Guide to Convolutional Neural Networks” – which is one of their better books but still full of shoddy editing (though I’d recommend persisting with it.)

The training is through ‘back propagation’ – essentially adjusting the network to minimise errors by testing it against a set of known results. Getting a large set of pictures of do the training against is maybe even more difficult than getting the maths of the training right. Even if I recycle the images from last time I will need a lot more.

The problem with parallelisation


Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.