I am writing this down partly as a way of seeing if it makes sense…
Minor faults we can live with: if we have to reaffix a mapping back into the TLB it takes a few cycles but that is not a heavy burden. Major faults, however, are a different matter: they take many thousands of processor cycles to fix and have to be avoided at all costs.
Yet we have a basic problem in that we cannot mark each and every page access with a timestamp, so we have to guess at which page was least recently used – and so typically we use the sweeping hand of a CLOCK algorithm to do this.
But I have an even bigger problem – I have no time source, and hence no timer interrupts, to use.
So instead I want to try this:
All entries in the page table (PT) will have two validity bits, one we call VALID (V) and another AVAILABLE (A);
When a page is first mapped in the PT, both V and A will be marked as on.
When we first run out of space in the PT we will start to advance through the PT on each access turning off a V bit. So, if we pin the first three entries in the PT (eg for ‘kernel’ code, the PT itself and a stack), the first time we go round we might mark V off for the 16 entries in the PT from 2 – 18
Then, when we are mapping a page for which no mapping exists, we take the first entry where V is off
But before that we check whether we have any entries where A is on that map our page, in which case we switch V on for that mapping and we are off
Having written that out I can see we only need the A bit if we think that a 0x00 <–> 0x00 mapping might be a valid one, but we don’t have to make that assumption and so we can restrict this to just the V bit.
Had a letter from Birkbeck today telling me I had passed all my exams – so the issue now is finishing the MSc project and so getting the degree (I suppose, in theory at least as I have yet to be formally awarded it, I could now claim to have reached post graduate diploma level).
Right now, on the project, I am testing small kernel patches to see if a localised page replacement policy makes any noticeable difference for systems that are, or are in danger of, thrashing (ie spending nearly all their time waiting for paging to and from disk as opposed to doing any real computing).
The first patch I tried, forcing the largest and a large (not necessarily second-largest) process to write any dirty pages to disk had no noticeable effect – having thought about it a bit more, I now realise is unlikely that file backed pages are going to be dirty in this way for many processes, so actually all this code is likely to have done little but degrade performance.
What I really need is an aggressive rifle through the biggest processes page stack, essentially accelerating the hands of the CLOCK (of the CLOCK page replacement algorithm) for this process (or, diminishing the handspread, to use a term often seen in this context) compared to those in general. So that’s next.
This is the first “normal” – not abroad or just back, not jet lagged and so on – weekend I’ve been able to have at home in a month and it has also been the first time in that period where I have been able to expend some time to looking further at my proposed MSc project – on extending working set heuristics in the Linux kernel.
The good news is that I am once more convinced of the utility of, and enthusiastic about the implementation of, the idea. At the risk of looking very naive in six months (or six weeks) time even in my own eyes – here is the core idea:
Peter Denning’s 1968 and 1970 papers on the working set and virtual memory made some bold claims – calling global page replacement algorithms “in general sub-optimal” and asserting that the working set method is the best practical guarantee against thrashing.
Windows NT and its derivatives (XP, Vista, 7 etc) reflect their heritage from VMS in using a working set based replacement policy.
In contrast Linux (and the Unix family generally) use global replacement policies: indeed a fairly simple clock algorithm stands at the centre of Linux’s page replacement policies. Kernel developers say the policy works well in practice and that, in effect, the active “least recently used” list of cached pages – against which the clock algorithm runs, is a list of pages in the working sets of running processes.
My essential idea is to seek to trim the active list on a process-by-process basis when the system is under high load (the long delay in execution caused by a page fault hopefully making it efficient to execute the extra code in the hope of reducing the number of page faults.) Pages from the active list that are owned by the processes with the biggest memory footprint will be dropped into the inactive list, so making it more likely they will be eventually swapped out.
The second aspect of the application of a working set heuristic will be to alter the scheduling priorities of processes depending on their memory footprint. There are a few options here and I have not looked at this closely enough yet, but things to test could include:
Increasing the priority of the smallest processes – on the basis these might reach the end of execution more quickly and so release memory back to the pool
Radically lowering the priorities of the processes whose pages are being swapped out – on the basis that they do not have a working set of resources available and so, as Denning argued forty years ago, should not be able to run
In practical terms I am still some way off writing any kernel code. I have, though, written some user tools (still need polishing) to display the memory footprint of Linux processes in a red-black tree (the representation used internally by the kernel). Following Eric S Raymond (on Unix programming not politics!), the tools are partitioned into single applications that do different things – but run together they can generate graphics such as the one below: