Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.
Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.
Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.
Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.
My LRU queue bug is continuing to puzzle me – and it’s not as simple as a data misalignment. In fact it does not appear to be a data misalignment issue at all: before I was trapping a lot of hardware exceptions under that header because it was a common fault when I got the code wrong, but a closer examination showed it to be an illegal opcode exception.
How that could be caused by the size of the local memory we were simulating was beyond me – but perhaps some code was being pushed out of alignment and an illegal instruction created, I thought.
But that does not appear to be the issue at all – in fact the really puzzling thing is that the exact same string of opcodes at the same addresses runs without a problem in the version with the functional memory sizes as with the “broken” memory sizes.
The only difference seems to be that when the broken code (ie the setup with the non number of 4k memory pages) raises an illegal opcode exception, the good code raises a page fault.
It looks like it might be a bug in the simulator itself – and having written that I am hoping that the bad workman’s curse now befalls me and I quickly find it was all my fault to begin with. But as of now I am drawing a blank.
For the last week or so I have been writing and then debugging (and so mainly debugging) a least-recently-used (LRU) page replacement system on my Microblaze simulation.
Perhaps I shouldn’t have bothered – I had a working first-in-first-out (FIFO) system after all. But no one seriously uses FIFO, so I had to write some LRU code.
I thought I had it working tonight – it ran through the target exercise in about 6 million instructions (as the MMU in a Microblaze is crude, memory gets loaded in and out ‘by hand’ and so the instruction count measures time/efficiency) when the system had 32 4k local pages and in about 10.5 million instructions when it had 24 4k pages available locally – all seems fine: less pages means more time is required.
But then things started to get weird – testing the system with 28 pages took about 9 million instructions, but when I tried to use 26 pages I had to break the code after it had executed 14 trillion instructions.
Indeed it seems to only work for a very limited number of page counts. How odd – though a typically debuggers story. A week in, finally thinking I’d cracked it when some really strange behaviour manifests itself.
Update: It appears to be an unaligned data exeception issue. Somewhere along the line a piece of code relies on the LRU queue to be a multiple of 4 in length would be my guess…
This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.
So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.
Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).
I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.
That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.
These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.
What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.
Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).
I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).
All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.
I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.
So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.
The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.
The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.
But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!
But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.
Interestingly, the two series are very close to isomorphic (i.e., they have essentially the same shape if one scales the x-plot appropriately).
This is based on running a short piece of code to decode 32 frames of video – the memory model used was 4k pages with a pooled 512k on chip (each chip can access any part of the pool at the cost of 1 tick) and unlimited memory 100 ticks away per 128 bits of read-in (so for a 4k page it takes 25,600 ticks to fault the page in).
This is a bit of a crude simplification of the memory model seen on, say, a Tilera TILE64 system, but it’s close enough (though this simulation had 16 cores).
The graphs look similar but the differences are important – this one (the correct one), appears to confirm that Peter Denning‘s findings about the working set model versus LRU still hold good, at least in broad terms – though this still suggests LRU has better performance characteristics than might be expected.
But it’s late now and I am going to bed – perhaps more later.
Once I published the graph on the previous blog entry I more or less immediately realised it was wrong – it’s not that the curves are wrong: it’s that they are plotted on different scales.
The red line plots lifetime curve using the working set of the process based on pages accessed in a certain time frame – this gives an average working set size () which is plotted along the x axis.
The blue line is the lifetime curve with a maximum working set of a fixed size (ie it is a simple LRU type page replacement simulation). But it is not scaled correctly against the x axis. Something of a waste of 31 hours of wallclock time!
Happily my 12 core box is now online and so I should have a replot done shortly – my best guess is that it may not change too much, things will be interesting if it does.