Hard fault count drives performance

Just to emphasise how hard faults are determining for performance – here is a plot of hard faults versus page count for the same application mentioned in the previous post.

hard faults The pattern is very similar, though it should be noted that increasing the page count does still keep the fault count coming down at the far end – but not by enough to outweigh the negative effects of having larger page tables to search through when checking for faults and looking for the least recently used page and the like.


Squashing the LRU bug

English: Typical actions taken upon a virtual-...
English: Typical actions taken upon a virtual-physical address translation. (Photo credit: Wikipedia)

Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.

Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.

Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.

Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.

So now the code works – at least I think it does!

Curiouser and curiouser – the case of the LRU bug

A 256Kx4 Dynamic RAM chip on an early PC memor...
A 256Kx4 Dynamic RAM chip on an early PC memory card. (Photo by Ian Wilson) (Photo credit: Wikipedia)

My LRU queue bug is continuing to puzzle me – and it’s not as simple as a data misalignment. In fact it does not appear to be a data misalignment issue at all: before I was trapping a lot of hardware exceptions under that header because it was a common fault when I got the code wrong, but a closer examination showed it to be an illegal opcode exception.

How that could be caused by the size of the local memory we were simulating was beyond me – but perhaps some code was being pushed out of alignment and an illegal instruction created, I thought.

But that does not appear to be the issue at all – in fact the really puzzling thing is that the exact same string of opcodes at the same addresses runs without a problem in the version with the functional memory sizes as with the “broken” memory sizes.

The only difference seems to be that when the broken code (ie the setup with the non mod 4 number of 4k memory pages) raises an illegal opcode exception, the good code raises a page fault.

It looks like it might be a bug in the simulator itself – and having written that I am hoping that the bad workman’s curse now befalls me and I quickly find it was all my fault to begin with. But as of now I am drawing a blank.

Simulating global and local memory on OVP

The Simons' BASIC start-up screen
The Simons’ BASIC start-up screen (Photo credit: Wikipedia)

Had a good meeting with my PhD supervisor today: he was in London – I didn’t have to make a flying visit to York.

So the next steps with my OVPsim Microblaze code is to model global and local memory – by default OVPsim treats all memory as local, mapping the full 32-bit address space and sparsely filling that as needed. I have imposed an additional constraint of only mapping a few pages, but these are still all “local”: so while the code takes time to execute, what is in effect, a FIFO page replacement algorithm, there is no time for page loads.

The way round this seems to be to build my global memory as a memory-mapped peripheral device – I can then impose time delays on reads and writes.

But I suppose I am writing this blog instead of writing that code…

After paging?

Diagram of relationship between the virtual an...
Diagram of relationship between the virtual and physical address spaces. (Photo credit: Wikipedia)

Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.

Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.

So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.

It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.

But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.

But my PhD research is pointing me towards some of the weaknesses of the paging approach.

At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.

Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.

But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.

Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.

I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.

Enhanced by Zemanta

Thrash reduction no longer a priority for Linux kernel devs?

Version 3.5 of the Linux kernel has been released.

freshly installed ipod linux, booting. during ...
freshly installed ipod linux, booting. during Wikipedia:Workshop Köln (Photo credit: Wikipedia)

One of the changes it includes is the removal of the “swap token” code – one of the very few ‘local’ memory management policies grafted on to the ‘global’ page replacement mechanisms in the kernel.

There are various technical reasons offered for the removal of the code – on which I am not qualified to comment – but the borrow line is that it was broken in any case, so the removal seems to make sense.

What does slightly disturb me, though, is the comment that Rik Van Riel, the key figure in kernel memory management code, makes:

The days of sub-1G memory systems with heavy use of swap are over.
If we ever need thrashing reducing code in the future, we will have to
implement something that does scale.

I think the days of sub-1G systems are far from over. In fact I suspect there are more of them, and more of them running Linux, than ever before and that trend is not going to stop.

He’s right of course about the need to find that code that works – my own efforts (in my MSc report) didn’t crack this problem, but I do think there is more that can be done.

The binomial distribution, part 1

Image via Wikipedia

I think there are now going to be a few posts here which essentially are about me rediscovering some A level maths probability theory and writing it down as an aid to memory.

All of this is related as to whether the length of time pages are part of the working set is governed by a stochastic (probabilistic) process or a deterministic process. Why does it matter? Well, if the process was stochastic then in low memory situations a first-in, first-out approach, or simple single queue LRU approach to page replacement might work well in comparison to the 2Q LRU approach currently in use. It is an idea that is worth a little exploring, anyway.

So, now the first maths aide memoire – simple random/probabilistic processes are binomial – something happens or it does not. If the probability of it happening in a unit time period is p (update: is this showing up as ‘nm’? It’s meant to be ‘p’!) then the probability it will not happen is 1 - p = q .  For instance this might be the probability that an atom of Uranium 235 shows \alpha particle decay (the probability that one U 235 atom will decay is given by its half-life of 700 million years ie., 2.21\times10^{16} seconds, or a probability, if my maths is correct, of a given individual atom decaying in any particular second of approximately 4.4\times10^{-16} .

(In operating systems terms my thinking is that if the time pages spent in a working set were governed by similar processes then there will be a half life for every page that is read in. If we discarded pages after they were in the system after such a half life, or better yet some multiple of the half life, then we could have a simpler page replacement system – we would not need to use a CLOCK algorithm, just record the time a page entered the system and stick it in a FIFO queue and discard it when the entry time was more than a half life ago.

An even simpler case might be to just discard pages once the stored number reached above a certain ‘half life’ limit. Crude, certainly, but maybe the simplicity might compensate for the lack of sophistication.

Such a system would not work very well for a general/desktop operating system – as the graph for the MySQL daemon referred to in the previous blog shows, even one application could seem to show different distributions of working set sizes. But what if you had a specialist system where the OS only ran one application – then tuning might work: perhaps that could even apply to mass electionics devices, such as Android phones – after all the Android (Dalvik) VM is what is being run each time.)

The graph is wrong

Once I published the graph on the previous blog entry I more or less immediately realised it was wrong – it’s not that the curves are wrong: it’s that they are plotted on different scales.

The red line plots lifetime curve using the working set of the process based on pages accessed in a certain time frame – this gives an average working set size (\theta ) which is plotted along the x axis.

The blue line is the lifetime curve with a maximum working set of a fixed size (ie it is a simple LRU type page replacement simulation). But it is not scaled correctly against the x axis. Something of a waste of 31 hours of wallclock time!

Happily my 12 core box is now online and so I should have a replot done shortly – my best guess is that it may not change too much, things will be interesting if it does.

Working sets compared

NB: This graph is wrong – read more here

This shows the ‘lifetime curve’ of Xterm using a time based \theta for determining the working set (ie having a working set of variable size based on recency of access) and a fixed-sized working set.

The y-axis measures the average number of instructions executed between faults.

Working set sizes for Xterm

Lifecycle of a Linux page … 2

Took another day’s worth of work to get this right … last night’s effort had a few mistakes in it as well as being not such a great piece of Metapost. So here’s the correct version:

Lifecycle of a pageframe in Linux

As before, the Metapost of this image is available at http://github.com/mcmenaminadrian