Work changes and a determination to actually finish my PhD mean I really should make a bit of an effort here and so I will.
Here is a puzzle that has been bothering me about binary trees which has come from my PhD research…
In that research I am investigating how to implement virtual memory on a many-core Network-on-Chip (NoC) system. Essentially I have been building and running emulators.
The core of the emulation is the “memory tree” – the connection between the cores of the NoC and the global memory controller. The “Bluetree” memory tree is a binary tree arrangement, with the root node (level 0) connecting to the memory controller and the leaves at the NoC tiles (with each tile having a single processor).
At every non-leaf level of the tree there are multiplexors (mux) with a buffer that can hold a single memory packet – so if two requests for global memory arrive at once and the buffer is free there needs to be an arbitration process to decide which packet gets to occupy the buffer and which has to wait.
We have 128 leaf nodes – as in this tree…
With this many leaf nodes, the question of the arbitration policy of the muxes is important. A naïve approach would be to say, if two packets contest for the buffer, let the packet on the left win the first time and then pass the packet on the right: after all this guarantees no starvation.
But it is actually a hopeless policy … the leftmost leaf node (shall number from zero) faces no opposition, the rightmost (number 127) loses out to every other node.
But number of leaf node alone is not a sufficient way of describing the relative number of blocks – for instance leaf node 16 appears to be in a better place than leaf node 15 – as going up the tree leaf node 15 can be blocked at level 6 (the root node is at level 0), at level 5, level 4, level 3 and level 2: while leaf node 16 is only blocked at level 1.
Practical testing of this naïvely fair approach in the emulator gives results like the following… as can be seen the right hand part of the tree performs appallingly but even there lots of variation can be seen:
My question is this: what is the correct (i.e., correct mathematically) way of describing the order of the leaf nodes?
A week further on and not much C++ has been written – and now I think I need to make a new start.
Up to this point I have been trying to write a software model of the hardware and my thought was I could think put a software-modelling layer on top of that. But that simply is not going to work – it is just too complex.
Instead I am going to have to make some policy decisions in the software – essentially over how I model the local memory on the chip: each tile will process memory reads and writes and needs to know where that memory is – it could be in the global off-chip memory store or it could be on-chip.
The difference matters because, at least in theory, the on-chip memory is speedily accessible, while the off-chip memory is 50 to 100 to 500 times “further away”. Because memory accesses exhibit locality it makes sense to ship blocks of addressed memory from the global to the local store – but doing so takes time and if there are a lot of memory movements then we get thrashing.
What I now have to do is think of what policy I will use to decide what memory gets stored locally (or, more likely, what policy I use to map the addresses). I’ll start by once again reviewing papers that propose some schemes for existing Networks-on-Chip.
In other news: I have had a paper (of which I am co-author and first named author) accepted by OSPERTS 15 – so I will be off to Sweden to get mauled by the audience there in early July. It will be an experience, and I am looking forward to it, but I also think it might be not so much a baptism, but a destruction by fire.
Lying in bed this morning and puzzling over what to do …
At first I thought what I should do is copy one of the existing operating system models for NoCs, but that simply would not be flexible enough.
What I have to do is model the hardware (including the modifications to the MMU I want to see) as, essentially, some form of black box, and build other layers – including the memory tree – above that. That means I need to separate the CPU/tiles from global memory: sounds simple in theory but implementing this is going to be very far from easy.
What’s the best way to speed up your computing experience? As many know the most cost-effective way is often not to buy a new computer with a faster processor but to add more memory to an existing system.
The plot below, based on some results from my PhD research shows just how this works…
In my case I am measuring how long it takes a processor of a fixed speed to execute a given program while I vary the amount of memory available.
My research centres on “network-on-chip” systems so the issue here is how much memory is available locally (i.e., on the chip). Because if instructions and data are not locally available they have to be fetched in from some more “distant” store (whether that is system global memory or even a disk system). And if space on the chip is limited then you generally have to evict some other piece of memory to make way for the piece needed immediately. Yet if you need to evicted piece again in the future you have to take the time needed to fetch it back in again and so on.
In this case we are using 4096 byte (4K) blocks of memory – i.e., 4K pages. So when it says 10 pages on the x-axis that means there is 40K available locally and so on.
I am testing all this on the OVPsim emulator and I have no independent means of accurately timing how long the process takes – but I can count the number of instructions it takes to complete the task.
Factors that affect the number of instructions taken are – in order of likely importance:
the time taken to fetch whole pages in and out of memory – a “hard fault” occurs when a page has to be evicted and a new page brought in instead (the earliest hard faults are not as slow as the later ones as until the full number of pages are used no eviction is required);
the time it takes to ensure that pages that are in memory but which are not immediately available for execution (because the number of translation mappings that tell the central processor where they are is limited – to 8 in this test case – and so for each fault we first have to test if the page is available but unmapped). If an unmapped page is available we have a “soft fault” which is much quicker to process than a hard fault as no eviction etc is required;
picking the next page that will be evicted if have a hard fault – in this case we aim to pick the page that was least recently used, but even then we have to use an approximation. This process is not the same as evicting the page – it merely indicates which one is the best candidate for eviction should that become necessary.
The number of hard faults is reduced by increasing the amount of memory available (until, of course, you have so much memory you never need evict a page). But as you increase the amount of memory you also make checking for soft faults and picking the next candidate for eviction slower – because there are more pages to check.
And this is what we see in the plot. When memory is in short supply adding even a single extra page can make a big difference. There is then a long range where the decrease is not so great but constant. This reflects the fact that memory addresses are not needed at random and that programs show ‘locality of preference’ (ie if you need one address at one moment you are quite likely to need a nearby address at the next). This locality means that adding extra pages can have a limited impact once you have enough to cover the range of likely picks in a reasonable time frame. Then adding extra memory means that the changes between the different phases of locality becomes smoother, so we still see a reduction in time but not anywhere near as pronounced as before.
Then we get to the point – about 30 pages in this case – where we are close to covering the full range of addresses used, even when we switch between phases. In this case we see a sudden fall again until – at about 33 – 35 pages – we seem to have covered every single address our program will ever use.
After that having more memory makes the system slightly slower (NB in most real-world desk and laptop systems adding more memory does not slow down the system – so don’t worry about that!).
The lesson: if your computer is really slow and you don’t want to replace it, add more memory. But if you computer already has so much memory that it doesn’t know what to do with it and it is slow, you have a bigger problem!
In my PhD world a year’s worth of software experimentation has proved what we all knew already … that systems using traditional memory models struggle in the Network-on-Chip environment and so I am now trying something slightly different.
My “model” (it’s all in software) is of a 16 core system, with each core having a small amount of on-chip memory (32k), which are combined together to form a flat memory space. Memory in this space can be accessed quickly, memory outside it, in the next level up in the hierarchy, is roughly 100 times further away.
Using any form of traditional paging model (including Belady’s optimal page replacement algorithm) this system starts to thrash on even moderate loads – the cost of moving pages in and out of the local memory determines performance and so there is no benefit from adding additional processors (in fact it just slows the individual processors down).
Such an outcome makes any promise of improved performance from parallelism void – it does not really matter how efficiently you have parallelised the code (some corner cases excepted – eg if all chips were accessing the same memory at the same time), you are trapped by a memory I/O bound.
So now I want to look at alternatives beyond the usual 4k (or 2k) paging – but I have been struggling all week to get the locking semantics of my code right. Concurrency is hard.
The one thing that debugging parallel code and locks teaches you again and again is never to assume that some event will be so rare you don’t need to bother about it: because when you are executing millions of instructions a second, even rare events tend to happen.
It has also taught me to check return values – code that will “always” work in a single threaded environment may actually turn out to be quite a tricky customer when running in parallel with other instances of itself or when it is accessing shared memory.
But, finally, the main lesson this week has been about going atomic.
I have a tendency to think – if I can release that lock for a few lines of code that might improve overall performance and I can just lock it again a little later. Beware of that thought.
If you need to make a series of actions atomic you need to hold the same lock across them all – releasing it for even a few lines breaks atomicity and will quite likely break your code.
Every programmer will be familiar with something like this…
A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.
The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.
To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.
But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.
After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.
Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.
If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.
The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.
Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!
This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.
So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.
Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).
I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.
That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.
These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.
What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.
Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).
I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).
All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.
I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.
So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.
The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.
The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.
But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!
But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.
Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.
Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.
So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.
It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.
But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.
But my PhD research is pointing me towards some of the weaknesses of the paging approach.
At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.
Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.
But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.
Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.
I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.
This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.
It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.
I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.
The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.
In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).
Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.
That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?
I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.
Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.
So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.
This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.
The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.
Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?
Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.
Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.
So, the easy things do not work. Now I need to look at the hard things.
I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.
Now I have mapped the speed of OPT and LRU using a traditional two level memory hierarchy, my task is to find something better that might make NoC a more viable computing platform.
One thought that occurred to me was that the preference of C and similar languages for page aligned memory in certain situations might make it feasible or reasonable to cache the first 128 bits (16 bytes) in faster access memory.
So I wrote some code to test which 16 byte “segments” of a page were accessed by code.
The list that follows is for read/write accesses (i.e. not instructions). I will get around to plotting a graph, but it looks like that idea – which was always a bit of a straw clutching exercise – is not going to fly.