The problem with parallelisation

Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.

A plot that helps explain why the desktop computer industry is in trouble

Desktop computer sales are falling.

Desktop (and laptop) machines are just not improving in speed at the same rate as they have historically, and so sales are on the slide – there is no need to buy a faster machine when the one you have already is not much slower than an expensive replacement.

(I am writing this on a six year old laptop – it is, relatively speaking, quite slow, but still not so slow that I have been bothered to buy a faster machine.)

This lack of increased speed up is because, although “Moore’s Law” of doubling the number of transistors on a piece of silicon still applies, the related effect of “Dennard Scaling” – which kept the power needed by those transistors under control – has broken down.

The solution for this ought to be to use more but less energy-hungry, chips (CPUs). If we can get our software to efficiently use an array of not-so-fast chips rather than a single fast chip then we can still see faster computers. This is why computers (an increasingly mobile phones, because they too will soon be in trouble) are now sold as “multicore” devices.

But this has its limits. Unless we can find better ways for these chips to co-operate in software then current hardware models limit the ability to speed machines up.

And that, sort of, is where my chart comes in.

slowWhat the chart shows is that, using current software designs, adding more processors to tackling a problem can actually decrease the speed at which the problem is processed, and certainly shows no sign of speeding it up.

The chart – not to be taken as gospel as the data is very rough and ready and not properly normalized – shows how many instructions (in a simulator) an array of processors will execute per million or so processing cycles – essentially 10 means100,000), 20 means 200,000 and so on.

But the piece of information the chart does not show is that as we move from left to right the number of processors being applied to the problem is being increased and yet the number of instructions being executed is either more or less constant or even – as the first few processors are added – falling. After about 29,000,000,000 ticks the number of processors being applied begins to fall (from a peak of 16 to 11 at the far right) but again the number of instructions being processed is close to constant on average (the red line).

This isn’t because the code is hopelessly inefficient (for instance the processors are added or removed as new threads of execution come on-stream or finish) but because they compete with each other to access the available memory. This is an example of what computer scientists call “thrashing” – as the cores force each others’ preferred pieces of memory out of the system in an effort execute their own code (and this takes time), only to have the same thing done to them. Processors spend far more time waiting for their code and data to be fetched into memory than they do actually executing it.

Unless and until this fundamental problem is fixed or bypassed then the desktop computing industry will be in recession.

A supercomputer on every desktop? Not yet, anyway

My PhD is about operating systems on Network-on-Chip (NoC) computers. I have not actually done any research yet, so don’t expect anything here – but I have been playing with some existing data and I think it gives some interesting results.

NoCs are part of the semiconductor industry’s response to the failure of “Dennard scaling”: Moore‘s law says we can double the number of transistors on a chip every 18 – 24 months and we are still able to do that. Dennard scaling was the thing that made that super useful – because it meant the power requirements for the processors stayed constant even as they acquired more transistors. Now it has broken down, building faster chips becomes that much more difficult because, bluntly, they would burn up unless we limited the power.

NoCs aim to get round this by replacing one fast and power hungry processor on a single chip with several less powerful processors on the same chip – the idea being if we can attack the problem with several slower processors we can get the job done more quickly than if we used just one faster processor.

But there’s a catch, a big one, as captured in Seymour Cray‘s question:

would you rather plough a field with two strong bulls or a thousand chickens?

NoCs do not replace one fast chip with a dozen not quite as fast chips – they parcel out the power eaten by that one chip to the whole network on the chip – it’s not quite as bad as dividing the computing power by the number of chips (for that was the case there would be no advantage at all), but it is not fantastically above that.

Using work published by Hadi Esmaeilzadeh from the University of Washington along with others from the University of Wisconsin – Madison, the University of Texas at Austin and Microsoft Research, my projection is that, if we took one of today’s fast chips and parcelled up the power, then we would see computing power decline like this:

  • One processor: 100% computing power
  • Two processors: 65% computing power each
  • Four processors: 38% computing power each
  • Eight processors: 21% computing power each
  • Sixteen processors: 11% computing power each
  • Thirty-two processors: 6% computing power each
  • Sixty-four processors: 3% computing power each

Now, 64 x 3 = 192, so that might look like quite a good deal – a 92% speed up. But it is not that simple because some part of the code, even if just the bit that starts and ends your process, can only run on one processor even if all the rest can be split into 64 equal parts. And the pieces that will only run on one processor are now 33 times slower than they were before. The key balance is this: how much code can you run at nearly twice the speed (92% speed up) versus how much do you have to run at 33 times slower than before?

The answer is that you have to run a lot of code in the fast zone before you really see a big advantage.

45nm NoC modelled

As the graph suggests you would need to have about 99.9% of your code capable of running in parallel before you saw a guaranteed speedup with 64 processors in your NoC. Plenty of such code exists – such as in image handling and so on – but you are not likely to be running too much of it on your desktop computer at any given time (except perhaps when you are running a streaming video application) and the big disadvantage is that when you are not running the parallel code you are stuck with the 3% performance.

(Actually, it’s not quite as simple as that, as you may have a faster processor equipped to do the single thread of execution stuff, but then your computer starts to become more expensive.)

In the future chips will get faster – but maybe not that much faster. In a decade’s time they could be between 400% and 34% faster than they are today, depending on whether you are optimistic or pessimistic (realistic?) about processor technologies. That will help, but still not enough to put this technology in your desktop – as opposed to your games console or your television set or maybe a specialised processor in a top of the range machine.

So don’t expect your personal supercomputer too soon.

The end of Dennard scaling

Lots of people have heard of Moore’s law and probably think it means computers get twice as fast every 18 months or two years or so.

(Replica of the first transistor, US public domain picture)

Of course, that’s not quite what it really says – which is that we can put twice as many transistors on a chip every two years or so. And for most of the last fifty years that was pretty much equivalent to saying that computers did get twice as fast.

(And, in reality, “Moore’s law” isn’t really a law at all – it began as an observation on the way that commercial pressures were shaping the semi-conductor market, with chip/solid state electronics manufacturers competing by making their offerings more powerful. But it is also a mistake to state, as sometimes happens, that the “law” says we can pack twice as many transistors into a given space every two years – because that is not true either. In fact in the early 70s transistor density was doubling every three years or so, but chip sizes were also increasing, so transistor numbers doubled every 18 months, but by the 1990s the pace of chip growth slowed and so the time taken to double transistor numbers increased to 24 months.)

It’s now nearly a decade since what has been called “the breakdown of Moore’s law” and the switch to multicore processors instead of ever faster single chips. But again, this is wrong. Moore’s law has not broken down at all – transistor numbers are continuing to increase. What has happened is that it is no longer possible to keep running these transistors at ever faster speeds.

Because accompanying the decrease in transistor size which was the fundamental driver what is popularly described as Moore’s law was another, related, effect – “Dennard scaling” (named after Robert Dennard who led the research team from IBM which first described this effect in a 1974 paper). The key effect of Dennard scaling was that as transistors got smaller the power density was constant – so if there was a reduction in a transistor’s linear size by 2, the power it used fell by 4 (with voltage and current both halving).

This is what has broken down – not the ability to etch smaller transistors, but the ability to drop the voltage and the current they need to operate reliably. (In fact the reduction in linear scale of transistors ran slightly ahead of the reduction of other other components at the start of this century, leading to 3+ Ghz chips a bit faster than was expected – though that is all we got to.)

What has happened is that static power losses have increased every more rapidly as a proportion of overall power supplied as voltages have dropped. And static power losses heat the chip, further increasing static power loss and threatening thermal runaway – and complete breakdown.

Reasons for increased static power loss include complex quantum effects as component size falls and the chemical composition of chips is changed to handle the smaller sizes. And there seems to be no way out. “Moore’s law”, as popularly understood in the sense of ever faster chips, is dead and even if we don’t get the science, we understand all the same – as the recent catastrophic fall in PC sales has demonstrated: nobody is interested in buying a new computer for their desktop when the one they already have is not obsolete.

Instead we are buying smaller devices (as Moore’s law makes that possible too) and researchers (me included) are looking at the alternatives. My research is not in electronics, but software for multicore devices, but the electronics researchers have not completely given up hope they can use other methods to build faster chips, but there is nothing to show for it yet (outside the lab – see the links below for some recent developments).

Update: I have edited slightly for sense and to reflect the fact that when Gordon Moore first spoke about what was later termed his “law” there were no integrated circuits available and that recent developments by IBM point to at least the potential of solving the problem by developing new materials that might eliminate or minimise static power loss (see the links below).

What really drives Moore’s Law

Gordon Moore on a fishing trip
Gordon Moore on a fishing trip (Photo credit: Wikipedia)

Moore’s Law” is one of the more widely understood concepts in computer hardware. Many ordinary people, including those with little or no understanding of what goes in to making an integrated circuit, understand the idea that computer hardware becomes better (and cheaper) in some sort of geometric way. But, actually, the rate and reasons for this “law” have been in flux ever since it was first proposed in 1965.

As part of the very first baby steps on my road to getting a PhD I have to prepare a literature review and so that means demonstrating an understanding of the processes that mean while Gordon E. Moore‘s predictions about increased transistor density have broadly held-up, the era of ever faster single (or even small number) processor computers is decisively over.

So I have read this paper – Establishing Moore’s Law – which gives some interesting perspectives on what we now refer to as “Moore’s Law”.

The term “Moore’s Law”: Although coined by Carver Mead of CalTech the term was popularised by Robert Noyce, Moore’s co-founder at Intel, in an article in the Scientific American in 1977.

The original formulation was a prediction based on the need of IC makers to compete: In 1965, when Moore first formulated what we now call his “law”, it was actually a prediction based on the need of manufacturers of integrated circuits – then a relatively new and experimental technology – to compete with the manufacturers of single electronic components. His prediction was that manufacturers would need to radically cut the costs and increase the complexity of their products if they were to successfully compete.

The ‘law’ broke down in the late 1960s : From around 1965 to 1969 the ‘law’ didn’t work in the sense that the growth in chip speed and complexity did not match the prediction. In Moore’s view this was because manufacturers did not produce chips “whose complexity came close to the potential limit”.

The ‘law’ is not based on any fundamental character of ICs but on a wide interplay of technology and commercial factors: This is the overall theme of the paper – the early gains were based on the need for ICs to compete at all, then came gains founded on the introduction of computer aided design and then from a state sponsored drive by Japanese companies to gain a foothold in the market (in fact they came to dominate the market in memory technology). Then came the Microsoft-Intel alliance and on to the present…

A (final) speculation from “The Hidden Reality”

I have just finished Brian Greene’s The Hidden Reality: Parallel Universes and the Deep Laws of the Cosmos, so want to take this opportunity to once again endorse it and also to restate or reprise one of his speculations in the book: this time one grounded firmly in computing rather than string theory or conventional cosmology.

The key question at the heart of this speculation is this: do you think that, given we have seen computational power double every 18 – 24 months or so for around a century now, we will reach the point where we will be able to convince computer-based agents that they are in a physically real world when they are, in fact, in a computer-based simulation?

And if you answer yes to that question, do you realise that you are more or less forced to accept that you are one such agent and what you see around you as real is, in fact, a computer-generated “virtual reality”?

The reasoning is simple: once the technology to build a sufficiently convincing virtual world exists then such worlds are likely to rapidly grow in number and there will be far many more ‘people’ who are agents than people who are physically real. By simple averaging you and I are far more likely to be such agents than to be real.

It’s a truly mind-bending thought.

Saying ‘no’ to the question requires putting a hard limit on human progress in the field of computation. How can that accord with our experience of ever faster, more powerful and ubiquitous computing?

Well, there are some reasons – but even they are open to attack.

Reality appears smooth and continuous, but smooth and continuous numbers are not computable. Computable numbers are of finite extent – as otherwise the computer would never finish computing them. For instance no computer can ever compute \pi, only render an approximation. Indeed most numbers are “transcendental numbers” and inherently not computable.

But this argument – which sounds like a sure-fire winner for knocking down the idea that our reality is, in fact, a simulation, has a weakness: we cannot measure smoothness either. Our measurements are discrete and while it appears to us that the world is smooth and continuous, maybe it is not – it is just that the computed approximation is beyond our ability to (presently) measure it.

If, at some point in the future, we discovered a finite limit to measurement that was beyond physical explanation it would surely confirm we were in a simulation.

Have we reached “peak silicon” and what can we do about it?

Moore’s Law states that the number of transistors that can be squeezed into a given slice of silicon doubles every two years (or 18 months) – something I wrote about recently and where I declared “More transistors means greater speed, more and cheaper memory and so on … ”

Except, maybe not. As the graph below, shamelessly grabbed from Herb Stutter’s “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software“, shows, while Moore’s Law (the green graph) holds true, the other associated improvements that we have come to expect to parallel it, such as a similar increase in efficiency per watt (royal blue graph) and clock speed (navy blue) have not. In short, we can build cheaper chips but they are not necessarily much faster.

Herb Sutter's graph of CPU performanceAnd, as this article recounts, we are now talking about “dark silcon” – bits of chips that have to remain unpowered while other parts are in use so as to ensure the whole chip does not fry or fail due to too high power consumption.

So, if we have reached the point of “peak silicon” what can we do about it?

The chip manufacturers have responded by packing more and more cores into their devices and that works up to a point – we do not even need to have very parallel coding outside the operating system to take some advantage of that on even a typical multitasking desktop machine. But none of us are doubling the number of video renderers, MP3 decoders, database queries and spreadsheet calculations we run in parallel every 18 months, so the “Moore paradigm” of computing power doubling in that time will be lost.

A more fundamental alternative is to rewrite our software so that it becomes inherently designed to take advantage of multicore machines. Writing effective parallel software is not easy, but it can be done for lots of tasks. But even then there are limits – “Amdahl’s law” reminds us that parallelisation will only speed the parts of a program that can be run in parallel: if say we had a piece of code that must be run in serial and takes 5 seconds, and some code that currently takes 55 seconds but could be made perfectly parallel, then if we had 2 processors it takes 5 seconds (serial time), plus 27.5 seconds for the parallel code, doubling the processors but not quite halving the time, with a 46% saving. Doubling the number of processors again (to 4) cuts total computing time to 18.75 seconds but the proportional saving has dropped to 42%. In other words, the “Moore paradigm” also disappears.

The third thing we can do is look for better algorithms: the recent announcement of a vastly improved fast fourier transform (FFT) algorithm shows what can be done here – algorithmic improvement can vastly outstrip hardware speedup. But currently for many problems (those in NP space) there is no prior known algorithm available and computing power can be simply dedicated to going through all the possible algorithms looking for the one that works (we do not know what algorithms solves an NP problem but once a solution is found we can verify it ‘easily’). Assuming, as most mathematicians are said to do, that P does not equal NP (ie there is no yet to be discovered algorithm that cracks NP problems) this at least means that “peak silicon” will keep internet commerce safe for the foreseeable future but it is bad news in lots of other ways.

There is a fourth option, of course, which is to get a better physics – either for silcon fabrication, quantum computing or some other physics based innovation. Right now, though, these are probably still the least likely options but as the links below show, lots of people are working .

Better algorithms, not better hardware are what make your computer faster, faster

Moore's Law, The Fifth Paradigm.
Image via Wikipedia

Many people have heard of Moore’s Law – which states that the number of transistors that can be packed into a piece of silicon doubles every two years (or 18 months in another formulation). More transistors means greater speed, more and cheaper memory and so on … so in the thirty two years since I first saw a real “micro-computer” raw processing power has increased by over 30000 times on even the conservative formulation of the law.

The contrast with other engineering improvements is enormous: typical lightbulbs have become six or seven times more energy efficient in that period – something that LED technologies might double again in the next few years. Great, but nothing like computing.

But this contribution from the electronics engineers, while of course impressive, is minuscule in comparison to some of the improvements won by the mathematicians and the computer scientists who have perfected better, faster, methods to solve the common problems. These improvements in algorithms are not universal, unlike Moore’s Law, but can be much, more bigger – often in the factor of millions of times faster.

We are on the verge of having another such improvement revealed – to the “fast fourier transform” (FFT) algorithm which allows signals to be broken down into their components for compression and other purposes.

The FFT improvement is said to be (the paper is yet to be published and only the abstract is in the public domain) around a factor of 10 – which could have big implications for any application where fat data has to be pushed over a thin line and some loss of quality is permissible (mobile telephony is one obvious example but home sensors could be another).

The FFT image compression algorithm was patented some years ago, but as a pure mathematical procedure such a patient has no application in UK (or EU) law (and surely that is right – how can it be possible to patent something which already exists?) – it is not clear what the legal position of this innovation will be.

Once again, though, all this shows the fundamental importance of maths and science education to us all – and the importance of pure mathematical and scientific research. A little more emphasis on this and a little less on the “needs of industry” (very few employers are interested in research as opposed to getting their products out of the door a bit faster or cheaper) would be a welcome step forward for policy makers.

Algorithms matter … a lot

One of the things that most annoyed me about the article on graphics was the author’s subsequent attempts (once he came under criticism) to justify bad technological advice through the excuse that falling hardware prices meant that his advice to use inappropriately optimised formats did not matter.

For the truth is that cheaper hardware expands the use of technology at both the “fat” end (ie your 4GB quad core desktop) and the “thin” end – your mobile phone and lower still. As computing becomes ubiquitous then appropriate optimisation will be as important as it ever was.

And appropriate optimisation means efficient and effective algorithms.

In fact, suggests a report quoted here, improvement in algorithm efficiency has contributed more – by an order of magnitude – to advances in some branches of computing than falling hardware prices (I am using price here as an analogue for the general Moore’s Law advance).

The report only appears to cite two specific examples of how algorithmic improvement has far outstripped the silicon, so I would be wary of claiming in general that algorithms have contributed a 30 – 43 times greater improvement to computing efficiency in the 15 years between 1988 and 2003 (something not bothering the slashdot summariser, so this is now likely to become the myth), it is a remarkable finding nonetheless.