Every programmer will be familiar with something like this…
A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.
The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.
To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.
But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.
After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.
Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.
If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.
The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.
Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!
Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.
Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.
So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.
It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.
But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.
But my PhD research is pointing me towards some of the weaknesses of the paging approach.
At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.
Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.
But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.
Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.
I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.
But compile time error after compile timer error followed as I was told that I could not use this code with unsigned long or with const long and so on … but surely these are simple conversions I thought…
…Just me thinking like a C programmer again. Passing by reference is just that – there is no copying or conversion available and so long and unsigned long are fundamentally incompatible.
The solution? Just pass by value – afterall the saving in passing a more or less primative type like long by reference must be pretty minimal – passing the stream by reference is the real saving here (actually, it’s also practical – as we want the output to go to the real stream and not a copy: in C we’d use a pointer here).
This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.
It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.
I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.
The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.
In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).
Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.
That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?
I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.
Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.
So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.
This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.
The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.
Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?
Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.
Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.
So, the easy things do not work. Now I need to look at the hard things.
I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.
This follows on from the previous post – here are the plots.
These are based on a run of the PARSEC benchmark suite x264 program – encoding 32 frames of video using 18 threads – 16 worker threads: the plots show how often each 16 byte “line” is used – whether as an instruction is read or the memory is used for read-write storage. Sixteen bytes is both the size of a typical cache line and a read from a DDR memory.
The code plot might suggest there is some pattern – between about segments 100 (offset 0x640 inside the page) and 200 (offset 0xC80) there is an increased hit rate) but my guess is that is an artefact of the particular code being used here, rather a general issue (possible explanations would be a particular library function being very heavily used): though conceivably it could be an issue with the GCC compiler or linker.
That might be worth further exploration, but not for me. From visual inspection I am concluding that the distribution of accesses inside a 4k page doesn’t justify trying to pre-cache particular 16 byte “lines”.
This is not a story of a great debugging triumph – but it is one that points to a great debugging truth – study of the bug before you start to pore over your code is more likely to get you to a solution faster.
In OPT the idea is that, when we need to make room in main memory for a new page, we select for removal the page with the longest “resuse distance” – in other words, the one we will have to wait the longest (perhaps forever) before needing to use again. This algorithm is sometimes called the “clairvoyant” algorithm because it requires foreknowledge of which memory page will get used when. That does not happen very often in general use computing, but can often be the case in embedded computing, where the code does exactly the same thing over and over again.
In my case I am using a memory trace from a transcoding of 32 frames of video – a small (in terms of time on a real computing device) example of the sort of parallel task you might see in embedded devices. In the real world this runs for a few seconds – but it also generates 220GB of XML trace records spread across 18 threads.
With a single thread it’s easy to work out the reuse distance – you just look at how long it will be before a page gets referenced: you could even do this statically ahead of runtime and just the result up if you wanted.
That is not true in multithreaded code – one thread might run fast or low (eg while waiting for IO) and so the only way to do it is to measure the reuse distances for each page and for every thread:
For each thread calculate the minimum reuse distance of each page;
Then pick the page with the maximum minimum reuse distance across all threads.
I wrote some code that seemed to do this and on my testing subset of the 220GB XML it seemed to deliver good results. But whenever I ran it against the full trace it started brightly but then performance – by which I mean how fast the simulator ran through the traces in terms of the synthetic ticks generated by the simulator, or bluntly the simulated performance – just seemed to get worse and worse.
In fact the longer a simulated thread seemed to be running, the worse its performance got and the “fastest” thread was always the most recently spawned one, and the more threads that ran the worse this problem got.
Now, the combination of severely limited memory (in this case we were simulating a 16 core NoC with 32Kb local memory per core, which is pooled into one flat 512Kb space), performance can go downhill quite badly as the thread count climbs – though this fall off was catastrophic and it was plain that OPT was going to be worse than “least recently used” (LRU) – and that just cannot be correct! I have not sat down to write a mathematical proof of that but instinctively I know it to be true…
Reading through the code did not draw my attention to any obvious flaws, so I had to sit down and think about what the bug showed me – it worked well on short runs, the most recent thread seemed to do well and the recent threads in general did better than the longer established threads.
Even writing this down now makes it seem obvious – my code was in some way biased towards more recently created threads. And so instead searching through all the code looking for errors, I could instead home in on those parts of code that scanned through each thread.
I found such an error quite quickly but testing again showed that while the problem was diminished, it was not by much – I still had not found what was really to blame.
Another scan through the key sections of the code revealed the real error: when a thread tried to page in memory it only examined the reuse distances of itself and those threads created after it.
Thread data was stored in a linked list, but instead of starting the scan at the head of the list, the scan began at the pointer to the current thread. The result was that the most recent thread had a “perfect” OPT experience – on every page in its reuse distances were fully considered, but at the other extreme the first thread’s reuse distances were only considered when it came to page in memory – so the pages it used but were not used by any other thread were fair game – they appeared to have an infinite reuse distance and so were almost certainly the ones chosen for replacement more or less every time.
Fixing the code so that the scan began with the head of the linked list and not just the local pointer fixed the problem and the OPT simulator is now running – my guess is that it is going to show OPT to be two to three times more efficient than LRU.