The problem with parallelisation


Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.

Conv-nets are hard


Months ago I started work on a convolutional neural network to recognise chess puzzles. This evening after mucking about with the learning phase for weeks I thought I had scored a breakthrough – that magic moment when, during learning, a tracked value suddenly flips from the wrong result to the right one.

Brilliant, I thought – this is about to actually work, and I started tracking another value. Only to come back to the original and see that it had all gone wrong again.

False minima abound in training – which is essentially about getting the right coefficients for a large set of non-linear equations each with many thousands of parameters. Or maybe it wasn’t a false minimum at all – but the real one, but it’s just operating over an extremely small range of parameter values.

Will I ever find it again? And if I do can I find it for the other 25 classification results too?

(As an aside: I made the code parallel to speed it up, but it’s a classic example of Amdahl’s law – even on a machine with many more processors than the 26 threads I need and with no shortage of memory, the speed-up is between 3 and 4 even with the most heavy-duty calculations run in parallel.)

None of the easy things work


This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.

It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.

I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.

The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.

In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).

Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.

That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?

I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.

Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.

So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.

This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.

The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.

OPT and LRU compared - both fundamentally stymied by memory shortage
OPT and LRU compared – both fundamentally stymied by memory shortage

 

Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?

Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.

Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.

So, the easy things do not work. Now I need to look at the hard things.

I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.

Enhanced by Zemanta

How slow is a fast computer?


Valgrind
Valgrind (Photo credit: Wikipedia)

I am working on a simulation environment for a NoC. The idea is that we can take a memory reference string – generated by Valgrind – and then test how long it will take to execute with different memory models for the NoC. It’s at an early stage and still quite crude.

The data set it is working with is of the order of 200GB, though that covers 18 threads of execution and so, very roughly speaking, it is 18 data sets of just over 10GB each. I have written some parallel Groovy/Java code to handle it and the code seems to work, though there is a lot of work to be done.

I am running it on the University of York’s compute server – a beast with 32 cores and a lot of memory. But it is slow, slow, slow. My current estimate is that it would take about 10 weeks to crunch a whole dataset. The code is slow because we have to synchronise the threads to model the inherent parallelism of the NoC. The whole thing is a demonstration – with a vengeance – of Amdahl’s Law.

Even in as long a project as a PhD I don’t have 10 weeks per dataset going free, so this is a problem!

Have we reached “peak silicon” and what can we do about it?


Moore’s Law states that the number of transistors that can be squeezed into a given slice of silicon doubles every two years (or 18 months) – something I wrote about recently and where I declared “More transistors means greater speed, more and cheaper memory and so on … ”

Except, maybe not. As the graph below, shamelessly grabbed from Herb Stutter’s “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software“, shows, while Moore’s Law (the green graph) holds true, the other associated improvements that we have come to expect to parallel it, such as a similar increase in efficiency per watt (royal blue graph) and clock speed (navy blue) have not. In short, we can build cheaper chips but they are not necessarily much faster.

Herb Sutter's graph of CPU performanceAnd, as this article recounts, we are now talking about “dark silcon” – bits of chips that have to remain unpowered while other parts are in use so as to ensure the whole chip does not fry or fail due to too high power consumption.

So, if we have reached the point of “peak silicon” what can we do about it?

The chip manufacturers have responded by packing more and more cores into their devices and that works up to a point – we do not even need to have very parallel coding outside the operating system to take some advantage of that on even a typical multitasking desktop machine. But none of us are doubling the number of video renderers, MP3 decoders, database queries and spreadsheet calculations we run in parallel every 18 months, so the “Moore paradigm” of computing power doubling in that time will be lost.

A more fundamental alternative is to rewrite our software so that it becomes inherently designed to take advantage of multicore machines. Writing effective parallel software is not easy, but it can be done for lots of tasks. But even then there are limits – “Amdahl’s law” reminds us that parallelisation will only speed the parts of a program that can be run in parallel: if say we had a piece of code that must be run in serial and takes 5 seconds, and some code that currently takes 55 seconds but could be made perfectly parallel, then if we had 2 processors it takes 5 seconds (serial time), plus 27.5 seconds for the parallel code, doubling the processors but not quite halving the time, with a 46% saving. Doubling the number of processors again (to 4) cuts total computing time to 18.75 seconds but the proportional saving has dropped to 42%. In other words, the “Moore paradigm” also disappears.

The third thing we can do is look for better algorithms: the recent announcement of a vastly improved fast fourier transform (FFT) algorithm shows what can be done here – algorithmic improvement can vastly outstrip hardware speedup. But currently for many problems (those in NP space) there is no prior known algorithm available and computing power can be simply dedicated to going through all the possible algorithms looking for the one that works (we do not know what algorithms solves an NP problem but once a solution is found we can verify it ‘easily’). Assuming, as most mathematicians are said to do, that P does not equal NP (ie there is no yet to be discovered algorithm that cracks NP problems) this at least means that “peak silicon” will keep internet commerce safe for the foreseeable future but it is bad news in lots of other ways.

There is a fourth option, of course, which is to get a better physics – either for silcon fabrication, quantum computing or some other physics based innovation. Right now, though, these are probably still the least likely options but as the links below show, lots of people are working .