Traffic generation options


English: Read Only Memory
English: Read Only Memory (Photo credit: Wikipedia)

This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.

So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.

Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).

I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.

That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.

These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.

What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.

Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).

I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).

All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.

I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.

So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have  a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.

The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.

The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.

But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!

But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.

Deductive debugging


An illustration of multithreading where the ma...
An illustration of multithreading where the master thread forks off a number of threads which execute blocks of code in parallel. (Photo credit: Wikipedia)

This is not a story of a great debugging triumph – but it is one that points to a great debugging truth – study of the bug before you start to pore over your code is more likely to get you to a solution faster.

My example is my code to simulate Belady’s “OPT” page replacement algorithm for a multithreaded environment.

In OPT the idea is that, when we need to make room in main memory for a new page, we select for removal the page with the longest “resuse distance” – in other words, the one we will have to wait the longest (perhaps forever) before needing to use again. This algorithm is sometimes called the “clairvoyant” algorithm because it requires foreknowledge of which memory page will get used when. That does not happen very often in general use computing, but can often be the case in embedded computing, where the code does exactly the same thing over and over again.

In my case I am using a memory trace from a transcoding of 32 frames of video – a small (in terms of time on a real computing device) example of the sort of parallel task you might see in embedded devices. In the real world this runs for a few seconds – but it also generates 220GB of XML trace records spread across 18 threads.

With a single thread it’s easy to work out the reuse distance – you just look at how long it will be before a page gets referenced: you could even do this statically ahead of runtime and just the result up if you wanted.

That is not true in multithreaded code – one thread might run fast or low (eg while waiting for IO) and so the only way to do it is to measure the reuse distances for each page and for every thread:

  • For each thread calculate the minimum reuse distance of each page;
  • Then pick the page with the maximum minimum reuse distance across all threads.

I wrote some code that seemed to do this and on my testing subset of the 220GB XML it seemed to deliver good results. But whenever I ran it against the full trace it started brightly but then performance – by which I mean how fast the simulator ran through the traces in terms of the synthetic ticks generated by the simulator, or bluntly the simulated performance – just seemed to get worse and worse.

In fact the longer a simulated thread seemed to be running, the worse its performance got and the “fastest” thread was always the most recently spawned one, and the more threads that ran the worse this problem got.

Now, the combination of severely limited memory (in this case we were simulating a 16 core NoC with 32Kb local memory per core, which is pooled into one flat 512Kb space), performance can go downhill quite badly as the thread count climbs – though this fall off was catastrophic and it was plain that OPT was going to be worse than “least recently used” (LRU) – and that just cannot be correct! I have not sat down to write a mathematical proof of that but instinctively I know it to be true…

Reading through the code did not draw my attention to any obvious flaws, so I had to sit down and think about what the bug showed me – it worked well on short runs, the most recent thread seemed to do well and the recent threads in general did better than the longer established threads.

Even writing this down now makes it seem obvious – my code was in some way biased towards more recently created threads. And so instead searching through all the code looking for errors, I could instead home in on those parts of code that scanned through each thread.

I found such an error quite quickly but testing again showed that while the problem was diminished, it was not by much – I still had not found what was really to blame.

Another scan through the key sections of the code revealed the real error: when a thread tried to page in memory it only examined the reuse distances of itself and those threads created after it.

Thread data was stored in a linked list, but instead of starting the scan at the head of the list, the scan began at the pointer to the current thread. The result was that the most recent thread had a “perfect” OPT experience – on every page in its reuse distances were fully considered, but at the other extreme the first thread’s reuse distances were only considered when it came to page in memory – so the pages it used but were not used by any other thread were fair game – they appeared to have an infinite reuse distance and so were almost certainly the ones chosen for replacement more or less every time.

Fixing the code so that the scan began with the head of the linked list and not just the local pointer fixed the problem and the OPT simulator is now running – my guess is that it is going to show OPT to be two to three times more efficient than LRU.

 

Enhanced by Zemanta

In the steps of László Bélády


Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….

 

In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.

These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.

But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.

This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.

I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.