# Traffic generation options

This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.

So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.

Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).

I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.

That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.

These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.

What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.

Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).

I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).

All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.

I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.

So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have  a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.

The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.

The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.

But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!

But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.

# In the steps of László Bélády

Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….

In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.

These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.

But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.

This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.

I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.

# The binomial distribution, part 1

I think there are now going to be a few posts here which essentially are about me rediscovering some A level maths probability theory and writing it down as an aid to memory.

All of this is related as to whether the length of time pages are part of the working set is governed by a stochastic (probabilistic) process or a deterministic process. Why does it matter? Well, if the process was stochastic then in low memory situations a first-in, first-out approach, or simple single queue LRU approach to page replacement might work well in comparison to the 2Q LRU approach currently in use. It is an idea that is worth a little exploring, anyway.

So, now the first maths aide memoire – simple random/probabilistic processes are binomial – something happens or it does not. If the probability of it happening in a unit time period is $p$ (update: is this showing up as ‘nm’? It’s meant to be ‘p’!) then the probability it will not happen is $1 - p = q$.  For instance this might be the probability that an atom of Uranium 235 shows $\alpha$particle decay (the probability that one U 235 atom will decay is given by its half-life of 700 million years ie., $2.21\times10^{16}$ seconds, or a probability, if my maths is correct, of a given individual atom decaying in any particular second of approximately $4.4\times10^{-16}$.

(In operating systems terms my thinking is that if the time pages spent in a working set were governed by similar processes then there will be a half life for every page that is read in. If we discarded pages after they were in the system after such a half life, or better yet some multiple of the half life, then we could have a simpler page replacement system – we would not need to use a CLOCK algorithm, just record the time a page entered the system and stick it in a FIFO queue and discard it when the entry time was more than a half life ago.

An even simpler case might be to just discard pages once the stored number reached above a certain ‘half life’ limit. Crude, certainly, but maybe the simplicity might compensate for the lack of sophistication.

Such a system would not work very well for a general/desktop operating system – as the graph for the MySQL daemon referred to in the previous blog shows, even one application could seem to show different distributions of working set sizes. But what if you had a specialist system where the OS only ran one application – then tuning might work: perhaps that could even apply to mass electionics devices, such as Android phones – after all the Android (Dalvik) VM is what is being run each time.)