Reaching a decision

English: Distributed Memory
English: Distributed Memory (Photo credit: Wikipedia)

A week further on and not much C++ has been written – and now I think I need to make a new start.

Up to this point I have been trying to write a software model of the hardware and my thought was I could think put a software-modelling layer on top of that. But that simply is not going to work – it is just too complex.

Instead I am going to have to make some policy decisions in the software – essentially over how I model the local memory on the chip: each tile will process memory reads and writes and needs to know where that memory is – it could be in the global off-chip memory store or it could be on-chip.

The difference matters because, at least in theory, the on-chip memory is speedily accessible, while the off-chip memory is 50 to 100 to 500 times “further away”. ¬†Because memory accesses exhibit locality it makes sense to ship blocks of addressed memory from the global to the local store – but doing so takes time and if there are a lot of memory movements then we get thrashing.

What I now have to do is think of what policy I will use to decide what memory gets stored locally (or, more likely, what policy I use to map the addresses). I’ll start by once again reviewing papers that propose some schemes for existing Networks-on-Chip.

In other news: I have had a paper (of which I am co-author and first named author) accepted by OSPERTS 15 – so I will be off to Sweden to get mauled by the audience there in early July. It will be an experience, and I am looking forward to it, but I also think it might be not so much a baptism, but a destruction by fire.

Further thoughts on the simulation task

A Motorola 68451 MMU (for the 68010)
A Motorola 68451 MMU (for the 68010) (Photo credit: Wikipedia)

Lying in bed this morning and puzzling over what to do …

At first I thought what I should do is copy one of the existing operating system models for NoCs, but that simply would not be flexible enough.

What I have to do is model the hardware (including the modifications to the MMU I want to see) as, essentially, some form of black box, and build other layers – including the memory tree – above that. That means I need to separate the CPU/tiles from global memory: sounds simple in theory but implementing this is going to be very far from easy.

How much memory do you need?

What’s the best way to speed up your computing experience? As many know the most cost-effective way is often not to buy a new computer with a faster processor but to add more memory to an existing system.

The plot below, based on some results from my PhD research shows just how this works…

timingsIn my case I am measuring how long it takes a processor of a fixed speed to execute a given program while I vary the amount of memory available.

My research centres on “network-on-chip” systems so the issue here is how much memory is available locally (i.e., on the chip). Because if instructions and data are not locally available they have to be fetched in from some more “distant” store (whether that is system global memory or even a disk system). And if space on the chip is limited then you generally have to evict some other piece of memory to make way for the piece needed immediately. Yet if you need to evicted piece again in the future you have to take the time needed to fetch it back in again and so on.

In this case we are using 4096 byte (4K) blocks of memory – i.e., 4K pages. So when it says 10 pages on the x-axis that means there is 40K available locally and so on.

I am testing all this on the OVPsim emulator and I have no independent means of accurately timing how long the process takes – but I can count the number of instructions it takes to complete the task.

Factors that affect the number of instructions taken are – in order of likely importance:

  • the time taken to fetch whole pages in and out of memory – a “hard fault” occurs when a page has to be evicted and a new page brought in instead (the earliest hard faults are not as slow as the later ones as until the full number of pages are used no eviction is required);
  • the time it takes to ensure that pages that are in memory but which are not immediately available for execution (because the number of translation mappings that tell the central processor where they are is limited – to 8 in this test case – and so for each fault we first have to test if the page is available but unmapped). If an unmapped page is available we have a “soft fault” which is much quicker to process than a hard fault as no eviction etc is required;
  • picking the next page that will be evicted if have a hard fault – in this case we aim to pick the page that was least recently used, but even then we have to use an approximation. This process is not the same as evicting the page – it merely indicates which one is the best candidate for eviction should that become necessary.

The number of hard faults is reduced by increasing the amount of memory available (until, of course, you have so much memory you never need evict a page). But as you increase the amount of memory you also make checking for soft faults and picking the next candidate for eviction slower – because there are more pages to check.

And this is what we see in the plot. When memory is in short supply adding even a single extra page can make a big difference. There is then a long range where the decrease is not so great but constant. This reflects the fact that memory addresses are not needed at random and that programs show ‘locality of preference’ (ie if you need one address at one moment you are quite likely to need a nearby address at the next). This locality means that adding extra pages can have a limited impact once you have enough to cover the range of likely picks in a reasonable time frame. Then adding extra memory means that the changes between the different phases of locality becomes smoother, so we still see a reduction in time but not anywhere near as pronounced as before.

Then we get to the point – about 30 pages in this case – where we are close to covering the full range of addresses used, even when we switch between phases. In this case we see a sudden fall again until – at about 33 – 35 pages – we seem to have covered every single address our program will ever use.

After that having more memory makes the system slightly slower (NB in most real-world desk and laptop systems adding more memory does not slow down the system – so don’t worry about that!).

The lesson: if your computer is really slow and you don’t want to replace it, add more memory. But if you computer already has so much memory that it doesn’t know what to do with it and it is slow, you have a bigger problem!

Going atomic … or concurrency is hard

SVG Graph Illustrating Amdahl's Law
SVG Graph Illustrating Amdahl’s Law (Photo credit: Wikipedia)

In my PhD world a year’s worth of software experimentation has proved what we all knew already … that systems using traditional memory models struggle in the Network-on-Chip environment and so I am now trying something slightly different.

My “model” (it’s all in software) is of a 16 core system, with each core having a small amount of on-chip memory (32k), which are combined together to form a flat memory space. Memory in this space can be accessed quickly, memory outside it, in the next level up in the hierarchy, is roughly 100 times further away.

Using any form of traditional paging model (including Belady’s optimal page replacement algorithm) this system starts to thrash on even moderate loads – the cost of moving pages in and out of the local memory determines performance and so there is no benefit from adding additional processors (in fact it just slows the individual processors down).

Such an outcome makes any promise of improved performance from parallelism void – it does not really matter how efficiently you have parallelised the code (some corner cases excepted – eg if all chips were accessing the same memory at the same time), you are trapped by a memory I/O bound.

So now I want to look at alternatives beyond the usual 4k (or 2k) paging – but I have been struggling all week to get the locking semantics of my code right. Concurrency is hard.

The one thing that debugging parallel code and locks teaches you again and again is never to assume that some event will be so rare you don’t need to bother about it: because when you are executing millions of instructions a second, even rare events tend to happen.

It has also taught me to check return values – code that will “always” work in a single threaded environment may actually turn out to be quite a tricky customer when running in parallel with other instances of itself or when it is accessing shared memory.

But, finally, the main lesson this week has been about going atomic.

I have a tendency to think – if I can release that lock for a few lines of code that might improve overall performance and I can just lock it again a little later. Beware of that thought.

If you need to make a series of actions atomic you need to hold the same lock across them all – releasing it for even a few lines breaks atomicity and will quite likely break your code.

Curses on ncurses

gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!

Traffic generation options

English: Read Only Memory
English: Read Only Memory (Photo credit: Wikipedia)

This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.

So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.

Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).

I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.

That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.

These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.

What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.

Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).

I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).

All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.

I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.

So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have  a problem РI cannot wait three weeks for each experiment to run. There simply is not any time for that.

The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.

The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.

But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!

But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.

After paging?

Diagram of relationship between the virtual an...
Diagram of relationship between the virtual and physical address spaces. (Photo credit: Wikipedia)

Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.

Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.

So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.

It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.

But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.

But my PhD research is pointing me towards some of the weaknesses of the paging approach.

At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.

Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.

But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.

Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.

I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.

Enhanced by Zemanta