The problem with parallelisation

Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.

A plot that helps explain why the desktop computer industry is in trouble

Desktop computer sales are falling.

Desktop (and laptop) machines are just not improving in speed at the same rate as they have historically, and so sales are on the slide – there is no need to buy a faster machine when the one you have already is not much slower than an expensive replacement.

(I am writing this on a six year old laptop – it is, relatively speaking, quite slow, but still not so slow that I have been bothered to buy a faster machine.)

This lack of increased speed up is because, although “Moore’s Law” of doubling the number of transistors on a piece of silicon still applies, the related effect of “Dennard Scaling” – which kept the power needed by those transistors under control – has broken down.

The solution for this ought to be to use more but less energy-hungry, chips (CPUs). If we can get our software to efficiently use an array of not-so-fast chips rather than a single fast chip then we can still see faster computers. This is why computers (an increasingly mobile phones, because they too will soon be in trouble) are now sold as “multicore” devices.

But this has its limits. Unless we can find better ways for these chips to co-operate in software then current hardware models limit the ability to speed machines up.

And that, sort of, is where my chart comes in.

slowWhat the chart shows is that, using current software designs, adding more processors to tackling a problem can actually decrease the speed at which the problem is processed, and certainly shows no sign of speeding it up.

The chart – not to be taken as gospel as the data is very rough and ready and not properly normalized – shows how many instructions (in a simulator) an array of processors will execute per million or so processing cycles – essentially 10 means100,000), 20 means 200,000 and so on.

But the piece of information the chart does not show is that as we move from left to right the number of processors being applied to the problem is being increased and yet the number of instructions being executed is either more or less constant or even – as the first few processors are added – falling. After about 29,000,000,000 ticks the number of processors being applied begins to fall (from a peak of 16 to 11 at the far right) but again the number of instructions being processed is close to constant on average (the red line).

This isn’t because the code is hopelessly inefficient (for instance the processors are added or removed as new threads of execution come on-stream or finish) but because they compete with each other to access the available memory. This is an example of what computer scientists call “thrashing” – as the cores force each others’ preferred pieces of memory out of the system in an effort execute their own code (and this takes time), only to have the same thing done to them. Processors spend far more time waiting for their code and data to be fetched into memory than they do actually executing it.

Unless and until this fundamental problem is fixed or bypassed then the desktop computing industry will be in recession.

A supercomputer on every desktop? Not yet, anyway

My PhD is about operating systems on Network-on-Chip (NoC) computers. I have not actually done any research yet, so don’t expect anything here – but I have been playing with some existing data and I think it gives some interesting results.

NoCs are part of the semiconductor industry’s response to the failure of “Dennard scaling”: Moore‘s law says we can double the number of transistors on a chip every 18 – 24 months and we are still able to do that. Dennard scaling was the thing that made that super useful – because it meant the power requirements for the processors stayed constant even as they acquired more transistors. Now it has broken down, building faster chips becomes that much more difficult because, bluntly, they would burn up unless we limited the power.

NoCs aim to get round this by replacing one fast and power hungry processor on a single chip with several less powerful processors on the same chip – the idea being if we can attack the problem with several slower processors we can get the job done more quickly than if we used just one faster processor.

But there’s a catch, a big one, as captured in Seymour Cray‘s question:

would you rather plough a field with two strong bulls or a thousand chickens?

NoCs do not replace one fast chip with a dozen not quite as fast chips – they parcel out the power eaten by that one chip to the whole network on the chip – it’s not quite as bad as dividing the computing power by the number of chips (for that was the case there would be no advantage at all), but it is not fantastically above that.

Using work published by Hadi Esmaeilzadeh from the University of Washington along with others from the University of Wisconsin – Madison, the University of Texas at Austin and Microsoft Research, my projection is that, if we took one of today’s fast chips and parcelled up the power, then we would see computing power decline like this:

  • One processor: 100% computing power
  • Two processors: 65% computing power each
  • Four processors: 38% computing power each
  • Eight processors: 21% computing power each
  • Sixteen processors: 11% computing power each
  • Thirty-two processors: 6% computing power each
  • Sixty-four processors: 3% computing power each

Now, 64 x 3 = 192, so that might look like quite a good deal – a 92% speed up. But it is not that simple because some part of the code, even if just the bit that starts and ends your process, can only run on one processor even if all the rest can be split into 64 equal parts. And the pieces that will only run on one processor are now 33 times slower than they were before. The key balance is this: how much code can you run at nearly twice the speed (92% speed up) versus how much do you have to run at 33 times slower than before?

The answer is that you have to run a lot of code in the fast zone before you really see a big advantage.

45nm NoC modelled

As the graph suggests you would need to have about 99.9% of your code capable of running in parallel before you saw a guaranteed speedup with 64 processors in your NoC. Plenty of such code exists – such as in image handling and so on – but you are not likely to be running too much of it on your desktop computer at any given time (except perhaps when you are running a streaming video application) and the big disadvantage is that when you are not running the parallel code you are stuck with the 3% performance.

(Actually, it’s not quite as simple as that, as you may have a faster processor equipped to do the single thread of execution stuff, but then your computer starts to become more expensive.)

In the future chips will get faster – but maybe not that much faster. In a decade’s time they could be between 400% and 34% faster than they are today, depending on whether you are optimistic or pessimistic (realistic?) about processor technologies. That will help, but still not enough to put this technology in your desktop – as opposed to your games console or your television set or maybe a specialised processor in a top of the range machine.

So don’t expect your personal supercomputer too soon.

The end of Dennard scaling

Lots of people have heard of Moore’s law and probably think it means computers get twice as fast every 18 months or two years or so.

(Replica of the first transistor, US public domain picture)

Of course, that’s not quite what it really says – which is that we can put twice as many transistors on a chip every two years or so. And for most of the last fifty years that was pretty much equivalent to saying that computers did get twice as fast.

(And, in reality, “Moore’s law” isn’t really a law at all – it began as an observation on the way that commercial pressures were shaping the semi-conductor market, with chip/solid state electronics manufacturers competing by making their offerings more powerful. But it is also a mistake to state, as sometimes happens, that the “law” says we can pack twice as many transistors into a given space every two years – because that is not true either. In fact in the early 70s transistor density was doubling every three years or so, but chip sizes were also increasing, so transistor numbers doubled every 18 months, but by the 1990s the pace of chip growth slowed and so the time taken to double transistor numbers increased to 24 months.)

It’s now nearly a decade since what has been called “the breakdown of Moore’s law” and the switch to multicore processors instead of ever faster single chips. But again, this is wrong. Moore’s law has not broken down at all – transistor numbers are continuing to increase. What has happened is that it is no longer possible to keep running these transistors at ever faster speeds.

Because accompanying the decrease in transistor size which was the fundamental driver what is popularly described as Moore’s law was another, related, effect – “Dennard scaling” (named after Robert Dennard who led the research team from IBM which first described this effect in a 1974 paper). The key effect of Dennard scaling was that as transistors got smaller the power density was constant – so if there was a reduction in a transistor’s linear size by 2, the power it used fell by 4 (with voltage and current both halving).

This is what has broken down – not the ability to etch smaller transistors, but the ability to drop the voltage and the current they need to operate reliably. (In fact the reduction in linear scale of transistors ran slightly ahead of the reduction of other other components at the start of this century, leading to 3+ Ghz chips a bit faster than was expected – though that is all we got to.)

What has happened is that static power losses have increased every more rapidly as a proportion of overall power supplied as voltages have dropped. And static power losses heat the chip, further increasing static power loss and threatening thermal runaway – and complete breakdown.

Reasons for increased static power loss include complex quantum effects as component size falls and the chemical composition of chips is changed to handle the smaller sizes. And there seems to be no way out. “Moore’s law”, as popularly understood in the sense of ever faster chips, is dead and even if we don’t get the science, we understand all the same – as the recent catastrophic fall in PC sales has demonstrated: nobody is interested in buying a new computer for their desktop when the one they already have is not obsolete.

Instead we are buying smaller devices (as Moore’s law makes that possible too) and researchers (me included) are looking at the alternatives. My research is not in electronics, but software for multicore devices, but the electronics researchers have not completely given up hope they can use other methods to build faster chips, but there is nothing to show for it yet (outside the lab – see the links below for some recent developments).

Update: I have edited slightly for sense and to reflect the fact that when Gordon Moore first spoke about what was later termed his “law” there were no integrated circuits available and that recent developments by IBM point to at least the potential of solving the problem by developing new materials that might eliminate or minimise static power loss (see the links below).