The problem with parallelisation


Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.

Struggling


Die of an Intel 80486DX2 microprocessor (actua...
Die of an Intel 80486DX2 microprocessor (actual size: 12×6.75 mm) in its packaging. (Photo credit: Wikipedia)

Been a while since I’ve written here – been avoiding writing about politics, which has obviously not been so great for me in the last couple of weeks… but now I have something else to ruminate on.

I have reached a milestone, or perhaps basecamp, in my PhD research: having a model for memory management that needs further exploration. (Hopefully there may even be a paper on this soon.)

Some of that exploration will have to be in hardware, and that’s really beyond me but I can and should build a software model to test how a many core system built using this new model might operate.

So far I have been testing or proving concepts with OVPSim but it does not allow me to build a true asynchronous multi-core model, so I need to do that in software myself.

But where to begin – I have a list of classes that you might want to have in C++:

  • Computer – which would aggregate…
    • DRAM
    • Storage
    • NoC – which would aggregate…
      • Mesh
      • Tiles – which would aggregate…
        • CPU
        • Cache
        • Ports (to the Mesh)

I hope you can see how quickly this becomes complex – and all we are talking about here is a simple software framework to allow me to do the research (ie., delivering the software, complex as it is, is only the very start.)

I am struggling to know where to begin – C++ seems like the logical choice for this, but it’s not proving to be much fun. Particularly because my CPU class has to be able to “execute” some code – I thought about using a DSL but may stick to the XML output I got from my hacked Valgrind Lackey – as at least I can then use existing XML engines.

Should I build from the XML up – eg., get a CPU class that can hack the XML and pass the requests up the chain (eg via the cache up to the Mesh and up to the DRAM etc), or what?

Curses on ncurses


gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!

Discretion and safety


Here in the UK we have, this week, seen a minor political and family tragedy played out in the Vicky PryceChris Huhne case.

English: Chris Huhne at the Liberal Democrat S...

In short, Huhne, appointed in May 2010 as a Liberal Democrat member of the Cabinet, left his wife, Pryce, for another woman in 2011. She then sought revenge by telling journalists that, around a decade ago, she had lied to the police on his behalf to take the blame for a speeding incident. It went to court and Huhne pleaded guilty to perverting the course of justice and resigned as an MP (he had quit the Cabinet when charged) and last week Pryce, who had sought to rely on the archaic defence of “marital coercion” was convicted of the same offence. They both face the likelihood of jail when sentenced.

I knew Huhne a little through my work with Britain in Europe and I am sad to see his career ruined in this way, but that’s not what I am going to write about.

Because what really ruined Huhne was his decision to break the speed limit. If he had not been able to do that then he would still be a Cabinet minister even now. And it is realistic to believe that we will be able to stop people making such decisions in the next decade or so – if we want to.

At first glance it seems obvious that we should. Speeding cars kill people – rather more serious a matter than a domestic and political tragedy – and stopping them would save lives, probably disproportionally amongst the young (as drivers, passengers and pedestrians) and the poor.

The technological potential will certainly exist in a decade’s time. Today’s cars are already filled with microprocessors and the ongoing shift in chip fabrication to multicore systems means that cheap and reliable safety critical systems will be possible. Such systems do not just need to control a car’s speed – they also need to be able to survive component failure, and the response to the problem of the “power wall” in chip manufacture – build lots of lower powered chips into devices – means that the necessary redundancy will likely become available.

But will we do it? I have a bad feeling that we won’t.

Motorists are one of the most powerful lobbies in western societies. They tend to be wealthier and better educated than pedestrians and they also have an ideological pull in the debate where “libertarianism” has more supporters in the political class than it does in the real world (ie., politicians are more political than ordinary people and so see these debates in ideological terms). The freedom of “white van man” to drive like a maniac will turn into some “save Britain” issue for some people, I am sure.

The most comparable issue to me seems to be doctors. Doctors kill people, not deliberately, but because we give them too much discretion. If we sent more doctors on to hospital wards with computers that told them what to do instead of relying on doctors’ professional knowledge then we would almost certainly treat more people in a better way. Of course, we do not need to send doctors out at all in those circumstances – we could send nurses or some other forms of “para doctor” to do the job. These people would have enough medical training to be aware if the machine was malfunctioning and to carry out its instructions properly; but they would not require seven years at University medical school.

And the money we saved could be invested in better training for the doctors we did have, better public health, better drugs or better public parks – whatever.

But, again, doctors are a powerful constituency and quite willing and able to scare people with inaccurate stories about evil computers and the like: only the bravest of politicians takes them on – and even then not for too long.

This failure to control discretion is going to become an ever bigger elephant in the living room in many aspects of our lives in future. We have already seen how the internet has been smashing deference to medical professionals by making amateur analysis ever more powerful (I do not pretend this is not without its problems). But the revolution will become more intense when it enters the world of the professionals themselves. As another example think of lawyers, do we really need them for tasks like property conveyancing if we can build expert systems that can both manage a conveyancing algorithm and access the public data needed to ensure due diligence is applied?

“Go and invent something”


This, in terms, is what my supervisor said to me this afternoon – his point being that while I had successfully presented my literature review, the time had come to start looking at real things that could be put on Network-on-Chip systems.

So, some thinking required over Christmas.

I do have my parallel filesystem idea to look further at, but he’s also suggested one or two other areas to look at.

Anyway, Kansas is going bye bye.