A week further on and not much C++ has been written – and now I think I need to make a new start.
Up to this point I have been trying to write a software model of the hardware and my thought was I could think put a software-modelling layer on top of that. But that simply is not going to work – it is just too complex.
Instead I am going to have to make some policy decisions in the software – essentially over how I model the local memory on the chip: each tile will process memory reads and writes and needs to know where that memory is – it could be in the global off-chip memory store or it could be on-chip.
The difference matters because, at least in theory, the on-chip memory is speedily accessible, while the off-chip memory is 50 to 100 to 500 times “further away”. Because memory accesses exhibit locality it makes sense to ship blocks of addressed memory from the global to the local store – but doing so takes time and if there are a lot of memory movements then we get thrashing.
What I now have to do is think of what policy I will use to decide what memory gets stored locally (or, more likely, what policy I use to map the addresses). I’ll start by once again reviewing papers that propose some schemes for existing Networks-on-Chip.
In other news: I have had a paper (of which I am co-author and first named author) accepted by OSPERTS 15 – so I will be off to Sweden to get mauled by the audience there in early July. It will be an experience, and I am looking forward to it, but I also think it might be not so much a baptism, but a destruction by fire.
I hope you can see how quickly this becomes complex – and all we are talking about here is a simple software framework to allow me to do the research (ie., delivering the software, complex as it is, is only the very start.)
I am struggling to know where to begin – C++ seems like the logical choice for this, but it’s not proving to be much fun. Particularly because my CPU class has to be able to “execute” some code – I thought about using a DSL but may stick to the XML output I got from my hacked Valgrind Lackey – as at least I can then use existing XML engines.
Should I build from the XML up – eg., get a CPU class that can hack the XML and pass the requests up the chain (eg via the cache up to the Mesh and up to the DRAM etc), or what?
In my PhD world a year’s worth of software experimentation has proved what we all knew already … that systems using traditional memory models struggle in the Network-on-Chip environment and so I am now trying something slightly different.
My “model” (it’s all in software) is of a 16 core system, with each core having a small amount of on-chip memory (32k), which are combined together to form a flat memory space. Memory in this space can be accessed quickly, memory outside it, in the next level up in the hierarchy, is roughly 100 times further away.
Using any form of traditional paging model (including Belady’s optimal page replacement algorithm) this system starts to thrash on even moderate loads – the cost of moving pages in and out of the local memory determines performance and so there is no benefit from adding additional processors (in fact it just slows the individual processors down).
Such an outcome makes any promise of improved performance from parallelism void – it does not really matter how efficiently you have parallelised the code (some corner cases excepted – eg if all chips were accessing the same memory at the same time), you are trapped by a memory I/O bound.
So now I want to look at alternatives beyond the usual 4k (or 2k) paging – but I have been struggling all week to get the locking semantics of my code right. Concurrency is hard.
The one thing that debugging parallel code and locks teaches you again and again is never to assume that some event will be so rare you don’t need to bother about it: because when you are executing millions of instructions a second, even rare events tend to happen.
It has also taught me to check return values – code that will “always” work in a single threaded environment may actually turn out to be quite a tricky customer when running in parallel with other instances of itself or when it is accessing shared memory.
But, finally, the main lesson this week has been about going atomic.
I have a tendency to think – if I can release that lock for a few lines of code that might improve overall performance and I can just lock it again a little later. Beware of that thought.
If you need to make a series of actions atomic you need to hold the same lock across them all – releasing it for even a few lines breaks atomicity and will quite likely break your code.
This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.
It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.
I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.
The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.
In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).
Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.
That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?
I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.
Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.
So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.
This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.
The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.
Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?
Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.
Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.
So, the easy things do not work. Now I need to look at the hard things.
I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.
My PhD is about operating systems on Network-on-Chip (NoC) computers. I have not actually done any research yet, so don’t expect anything here – but I have been playing with some existing data and I think it gives some interesting results.
NoCs are part of the semiconductor industry’s response to the failure of “Dennard scaling”: Moore‘s law says we can double the number of transistors on a chip every 18 – 24 months and we are still able to do that. Dennard scaling was the thing that made that super useful – because it meant the power requirements for the processors stayed constant even as they acquired more transistors. Now it has broken down, building faster chips becomes that much more difficult because, bluntly, they would burn up unless we limited the power.
NoCs aim to get round this by replacing one fast and power hungry processor on a single chip with several less powerful processors on the same chip – the idea being if we can attack the problem with several slower processors we can get the job done more quickly than if we used just one faster processor.
But there’s a catch, a big one, as captured in Seymour Cray‘s question:
would you rather plough a field with two strong bulls or a thousand chickens?
NoCs do not replace one fast chip with a dozen not quite as fast chips – they parcel out the power eaten by that one chip to the whole network on the chip – it’s not quite as bad as dividing the computing power by the number of chips (for that was the case there would be no advantage at all), but it is not fantastically above that.
Using work published by Hadi Esmaeilzadeh from the University of Washington along with others from the University of Wisconsin – Madison, the University of Texas at Austin and Microsoft Research, my projection is that, if we took one of today’s fast chips and parcelled up the power, then we would see computing power decline like this:
One processor: 100% computing power
Two processors: 65% computing power each
Four processors: 38% computing power each
Eight processors: 21% computing power each
Sixteen processors: 11% computing power each
Thirty-two processors: 6% computing power each
Sixty-four processors: 3% computing power each
Now, 64 x 3 = 192, so that might look like quite a good deal – a 92% speed up. But it is not that simple because some part of the code, even if just the bit that starts and ends your process, can only run on one processor even if all the rest can be split into 64 equal parts. And the pieces that will only run on one processor are now 33 times slower than they were before. The key balance is this: how much code can you run at nearly twice the speed (92% speed up) versus how much do you have to run at 33 times slower than before?
The answer is that you have to run a lot of code in the fast zone before you really see a big advantage.
As the graph suggests you would need to have about 99.9% of your code capable of running in parallel before you saw a guaranteed speedup with 64 processors in your NoC. Plenty of such code exists – such as in image handling and so on – but you are not likely to be running too much of it on your desktop computer at any given time (except perhaps when you are running a streaming video application) and the big disadvantage is that when you are not running the parallel code you are stuck with the 3% performance.
(Actually, it’s not quite as simple as that, as you may have a faster processor equipped to do the single thread of execution stuff, but then your computer starts to become more expensive.)
In the future chips will get faster – but maybe not that much faster. In a decade’s time they could be between 400% and 34% faster than they are today, depending on whether you are optimistic or pessimistic (realistic?) about processor technologies. That will help, but still not enough to put this technology in your desktop – as opposed to your games console or your television set or maybe a specialised processor in a top of the range machine.
So don’t expect your personal supercomputer too soon.
The market for desktop computers is in desperate trouble (and that for laptops not much healthier) – the latest sign being the decision to take Dell private.
The issue is not that we don’t need desktops and laptops anymore, but rather that we do not need new ones: while Moore’s Law continues to increase the number of transistors we can fit on silicon, we cannot drive those transistors at ever faster rates as we cannot dissipate the heat.
So instead of having an option of shelling out to buy a new desktop (or laptop) to match the speed of our rivals’ machines, we can soldier on with the old machines, get a smaller, low energy device (such as a tablet – Moore’s Law won’t deliver faster devices but will deliver smaller ones of equivalent computing power or lower power consumption) or maybe buy a multicore device (but these too have their limits – bus based designs start to eat up power as they get more processors and the speed increase from putting on an extra processor falls far off a linear increase).
We might, of course, just opt for a no more powerful machine but just one that looks better – something Apple has profited from.
In the end this means that the economics of desktop computers is likely to shift fundamentally and as the market consolidates prices may even start rising.
There are still technological advances that will drive performance improvements – faster storage is the obvious example. But the golden age of the PC is over – indeed it probably has been for a few years now.
Microsoft’s desperation to get Windows 8 out the door and across all the possible platforms is one of the reactions to this: but right now I have to wonder if Redmond’s finest will still be with us in ten years. Win8 seems to be something of a turkey and is not making any headway in the smartphone/tablet world and if we do not buy new machines every 24 months, why should we shell out for a new copy of “Office”? And, of course, Linux is still nibbling away.
In the longer term new hardware designs – such as thousands of CPUs on a “network on a chip” could turn things upside down again (I should be researching this now and not writing this blog) – but to fully exploit the power of such systems we are going to need to rethink most of our software and programming models. And it’s still not clear to me if those sorts of machines will ever get to the desktop (as opposed to powering an ever more powerful internet of things through embedded computers).
Finally got to the bottom of my issue with power saving and scheduling on my Pentium D machine (essentially a dual core Pentium 4).
It seems apparently lowering heat output (the Pentium D is a notoriously hot running processor), the “ondemand” frequency scheduler is not likely to save power in the real world and has been deliberately broken by the kernel maintainers.
p4-clockmod is NOT true CPU frequency scaling, it just forces the CPU to idle on a periodic duty cycle and has no effect on CPU frequency. The clock modulation feature is basically just engaging the same mechanism the CPU uses to reduce heat output when it gets too hot, and which is not meant as a power saving mechanism. When engaged, it does reduce heat output and power usage, but not as much as it reduces system performance, and means the system will simply take longer to return to idle. In short, using p4-clockmod can only increase power usage in any real workload.