For the last week or so I have been writing and then debugging (and so mainly debugging) a least-recently-used (LRU) page replacement system on my Microblaze simulation.
Perhaps I shouldn’t have bothered – I had a working first-in-first-out (FIFO) system after all. But no one seriously uses FIFO, so I had to write some LRU code.
I thought I had it working tonight – it ran through the target exercise in about 6 million instructions (as the MMU in a Microblaze is crude, memory gets loaded in and out ‘by hand’ and so the instruction count measures time/efficiency) when the system had 32 4k local pages and in about 10.5 million instructions when it had 24 4k pages available locally – all seems fine: less pages means more time is required.
But then things started to get weird – testing the system with 28 pages took about 9 million instructions, but when I tried to use 26 pages I had to break the code after it had executed 14 trillion instructions.
Indeed it seems to only work for a very limited number of page counts. How odd – though a typically debuggers story. A week in, finally thinking I’d cracked it when some really strange behaviour manifests itself.
Update: It appears to be an unaligned data exeception issue. Somewhere along the line a piece of code relies on the LRU queue to be a multiple of 4 in length would be my guess…
I am writing this down partly as a way of seeing if it makes sense…
Minor faults we can live with: if we have to reaffix a mapping back into the TLB it takes a few cycles but that is not a heavy burden. Major faults, however, are a different matter: they take many thousands of processor cycles to fix and have to be avoided at all costs.
Yet we have a basic problem in that we cannot mark each and every page access with a timestamp, so we have to guess at which page was least recently used – and so typically we use the sweeping hand of a CLOCK algorithm to do this.
But I have an even bigger problem – I have no time source, and hence no timer interrupts, to use.
So instead I want to try this:
All entries in the page table (PT) will have two validity bits, one we call VALID (V) and another AVAILABLE (A);
When a page is first mapped in the PT, both V and A will be marked as on.
When we first run out of space in the PT we will start to advance through the PT on each access turning off a V bit. So, if we pin the first three entries in the PT (eg for ‘kernel’ code, the PT itself and a stack), the first time we go round we might mark V off for the 16 entries in the PT from 2 – 18
Then, when we are mapping a page for which no mapping exists, we take the first entry where V is off
But before that we check whether we have any entries where A is on that map our page, in which case we switch V on for that mapping and we are off
Having written that out I can see we only need the A bit if we think that a 0x00 <–> 0x00 mapping might be a valid one, but we don’t have to make that assumption and so we can restrict this to just the V bit.
I have now written some very basic but functional paging code for the Microblaze but have realised I am, essentially, going to have to start a lot of it again.
The Microblaze puts page tables firmly under software control and its memory management unit (MMU) concentrates on managing page references through its translation lookaside buffers (TLBs) instead. So a reference to a virtual address that is mapped in a TLB is handled invisibly to the user or programmer, but a reference to a memory address that may be mapped in a page table but not in the TLB generates an exception and everything is dumped into the lap of the programmer – this is what is often known as a “soft fault” (a “hard fault” being when there is no version of the page being referenced available in physical memory at all).
The number of TLBs is quite limited – just 64 – so to handle (or in my case, simulate), say, 512kB of physical memory through 4kB page frames you need to expect to handle quite a few soft as well as hard faults and so you need to take a view on which mappings get evicted from the TLB and – if you are referencing more than 512kB of program and data – which pages get evicted from the page table and, hence, physical memory.
This is where the difficulties start, because there is no simple way to know when a read through a TLB mapping takes place – the very point is that it is invisible (and hence quick) to the programmer or any software she or he may have written. The problem with that is that – for most situations – the only guide you have to future behaviour in terms of memory references is past behaviour: so it would be very helpful to know whether a page has been recently accessed (on the basis that if it was accessed recently then it will be accessed again soon).
The standard response to this sort of problem is some form of “CLOCK” algorithm – which was first implemented, I believe, for the Multics operating system.
Multics can be thought of as the estranged (and now late) father of Unix – the “Uni” being a play on the “Multi” of the earlier system – and both direction and through its offspring its influence on computers has been profound and one of our inheritances is CLOCK, some version of which is almost certainly running in the computer, phone or tablet on which you are reading this.
The principle of CLOCK is simple. A “clock hand” sweeps regularly through the page mappings marking them invalid, then on a subsequent attempt to reuse the page mapping the valid bit has to be reset (meaning the page has been used recently) or alternatively if a new mapping is needed then we can through out the first page in the list of mappings where the mapping is marked as invalid.
And this, or some version of it is what I am now going to have to implement for the Microblaze. The obvious thing to do is to have some sort of timer interrupt drive the time clock – though I am not even sure the Microblaze has a timer interrupt available – I’m guessing it doesn’t – as it would expect those to come from the board, so this could be tricky!
In my PhD world a year’s worth of software experimentation has proved what we all knew already … that systems using traditional memory models struggle in the Network-on-Chip environment and so I am now trying something slightly different.
My “model” (it’s all in software) is of a 16 core system, with each core having a small amount of on-chip memory (32k), which are combined together to form a flat memory space. Memory in this space can be accessed quickly, memory outside it, in the next level up in the hierarchy, is roughly 100 times further away.
Using any form of traditional paging model (including Belady’s optimal page replacement algorithm) this system starts to thrash on even moderate loads – the cost of moving pages in and out of the local memory determines performance and so there is no benefit from adding additional processors (in fact it just slows the individual processors down).
Such an outcome makes any promise of improved performance from parallelism void – it does not really matter how efficiently you have parallelised the code (some corner cases excepted – eg if all chips were accessing the same memory at the same time), you are trapped by a memory I/O bound.
So now I want to look at alternatives beyond the usual 4k (or 2k) paging – but I have been struggling all week to get the locking semantics of my code right. Concurrency is hard.
The one thing that debugging parallel code and locks teaches you again and again is never to assume that some event will be so rare you don’t need to bother about it: because when you are executing millions of instructions a second, even rare events tend to happen.
It has also taught me to check return values – code that will “always” work in a single threaded environment may actually turn out to be quite a tricky customer when running in parallel with other instances of itself or when it is accessing shared memory.
But, finally, the main lesson this week has been about going atomic.
I have a tendency to think – if I can release that lock for a few lines of code that might improve overall performance and I can just lock it again a little later. Beware of that thought.
If you need to make a series of actions atomic you need to hold the same lock across them all – releasing it for even a few lines breaks atomicity and will quite likely break your code.
Every programmer will be familiar with something like this…
A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.
The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.
To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.
But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.
After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.
Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.
If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.
The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.
Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!
This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.
So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.
Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).
I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.
That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.
These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.
What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.
Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).
I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).
All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.
I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.
So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.
The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.
The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.
But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!
But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.
I’ve had a tough week. Ran a half marathon (quite badly, but I finished – see the picture) on Sunday, hobbled around York University (blister on my foot) on Monday before returning to London and coping with the added stress of a new job.
All the while the failure of my latest piece of code to work was annoying me too – eating away at me and making me wonder if the whole PhD thing was a bit of Quixotic nonsense or even Panzan stupidity.
Anyone who has chased down a bug for days will know the feeling – the root cause must be out there somewhere, even if you have been through your code dozens of times and can see nothing wrong.
Sitting on a long commute to my place of (temporary) work this morning I realised that my problem could only be down to one thing – a race condition.
I am emulating an LRU/2 type page replacement policy – with two page queues – a “high” and a “low”: pages are initially assigned to low, but can be pushed into high on a second reference. Pages only leave via low (they can be pushed out of high into low if we run out of space in high and so on).
With one thread running there was no problem, but when a second thread came on board the application fell over – and I realised it was because the manipulation of the LRU queues was not atomic.
And, a few dozen code changes later (done on the reverse journey back to London) – and some accounting for C++ template strangeness (it seems that an iterator across a map, means the map is not const – that took me a while to work out) and the problem has gone away and suddenly academic endeavour feels worth it again.
This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.
It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.
I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.
The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.
In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).
Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.
That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?
I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.
Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.
So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.
This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.
The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.
Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?
Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.
Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.
So, the easy things do not work. Now I need to look at the hard things.
I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.
Interestingly, the two series are very close to isomorphic (i.e., they have essentially the same shape if one scales the x-plot appropriately).
This is based on running a short piece of code to decode 32 frames of video – the memory model used was 4k pages with a pooled 512k on chip (each chip can access any part of the pool at the cost of 1 tick) and unlimited memory 100 ticks away per 128 bits of read-in (so for a 4k page it takes 25,600 ticks to fault the page in).
This is a bit of a crude simplification of the memory model seen on, say, a Tilera TILE64 system, but it’s close enough (though this simulation had 16 cores).
This is not a story of a great debugging triumph – but it is one that points to a great debugging truth – study of the bug before you start to pore over your code is more likely to get you to a solution faster.
In OPT the idea is that, when we need to make room in main memory for a new page, we select for removal the page with the longest “resuse distance” – in other words, the one we will have to wait the longest (perhaps forever) before needing to use again. This algorithm is sometimes called the “clairvoyant” algorithm because it requires foreknowledge of which memory page will get used when. That does not happen very often in general use computing, but can often be the case in embedded computing, where the code does exactly the same thing over and over again.
In my case I am using a memory trace from a transcoding of 32 frames of video – a small (in terms of time on a real computing device) example of the sort of parallel task you might see in embedded devices. In the real world this runs for a few seconds – but it also generates 220GB of XML trace records spread across 18 threads.
With a single thread it’s easy to work out the reuse distance – you just look at how long it will be before a page gets referenced: you could even do this statically ahead of runtime and just the result up if you wanted.
That is not true in multithreaded code – one thread might run fast or low (eg while waiting for IO) and so the only way to do it is to measure the reuse distances for each page and for every thread:
For each thread calculate the minimum reuse distance of each page;
Then pick the page with the maximum minimum reuse distance across all threads.
I wrote some code that seemed to do this and on my testing subset of the 220GB XML it seemed to deliver good results. But whenever I ran it against the full trace it started brightly but then performance – by which I mean how fast the simulator ran through the traces in terms of the synthetic ticks generated by the simulator, or bluntly the simulated performance – just seemed to get worse and worse.
In fact the longer a simulated thread seemed to be running, the worse its performance got and the “fastest” thread was always the most recently spawned one, and the more threads that ran the worse this problem got.
Now, the combination of severely limited memory (in this case we were simulating a 16 core NoC with 32Kb local memory per core, which is pooled into one flat 512Kb space), performance can go downhill quite badly as the thread count climbs – though this fall off was catastrophic and it was plain that OPT was going to be worse than “least recently used” (LRU) – and that just cannot be correct! I have not sat down to write a mathematical proof of that but instinctively I know it to be true…
Reading through the code did not draw my attention to any obvious flaws, so I had to sit down and think about what the bug showed me – it worked well on short runs, the most recent thread seemed to do well and the recent threads in general did better than the longer established threads.
Even writing this down now makes it seem obvious – my code was in some way biased towards more recently created threads. And so instead searching through all the code looking for errors, I could instead home in on those parts of code that scanned through each thread.
I found such an error quite quickly but testing again showed that while the problem was diminished, it was not by much – I still had not found what was really to blame.
Another scan through the key sections of the code revealed the real error: when a thread tried to page in memory it only examined the reuse distances of itself and those threads created after it.
Thread data was stored in a linked list, but instead of starting the scan at the head of the list, the scan began at the pointer to the current thread. The result was that the most recent thread had a “perfect” OPT experience – on every page in its reuse distances were fully considered, but at the other extreme the first thread’s reuse distances were only considered when it came to page in memory – so the pages it used but were not used by any other thread were fair game – they appeared to have an infinite reuse distance and so were almost certainly the ones chosen for replacement more or less every time.
Fixing the code so that the scan began with the head of the linked list and not just the local pointer fixed the problem and the OPT simulator is now running – my guess is that it is going to show OPT to be two to three times more efficient than LRU.