The problem with parallelisation

Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.

Author: Adrian McMenamin

Talk to the hand

2 thoughts on “The problem with parallelisation”

  1. I’m curious what your take is on the slightly tangential growth in computing platforms specific to ML. Specifically the initial application of GPGPU to the problem which then transitioned to more ML-specific CU on GPUs to the latest announcement from Cerebras and its WSE.

Comments are closed.