# Thrash reduction no longer a priority for Linux kernel devs?

Version 3.5 of the Linux kernel has been released.

freshly installed ipod linux, booting. during Wikipedia:Workshop Köln (Photo credit: Wikipedia)

One of the changes it includes is the removal of the “swap token” code – one of the very few ‘local’ memory management policies grafted on to the ‘global’ page replacement mechanisms in the kernel.

There are various technical reasons offered for the removal of the code – on which I am not qualified to comment – but the borrow line is that it was broken in any case, so the removal seems to make sense.

What does slightly disturb me, though, is the comment that Rik Van Riel, the key figure in kernel memory management code, makes:

The days of sub-1G memory systems with heavy use of swap are over.
If we ever need thrashing reducing code in the future, we will have to
implement something that does scale.

I think the days of sub-1G systems are far from over. In fact I suspect there are more of them, and more of them running Linux, than ever before and that trend is not going to stop.

He’s right of course about the need to find that code that works – my own efforts (in my MSc report) didn’t crack this problem, but I do think there is more that can be done.

# The binomial distribution, part 1

Image via Wikipedia

I think there are now going to be a few posts here which essentially are about me rediscovering some A level maths probability theory and writing it down as an aid to memory.

All of this is related as to whether the length of time pages are part of the working set is governed by a stochastic (probabilistic) process or a deterministic process. Why does it matter? Well, if the process was stochastic then in low memory situations a first-in, first-out approach, or simple single queue LRU approach to page replacement might work well in comparison to the 2Q LRU approach currently in use. It is an idea that is worth a little exploring, anyway.

So, now the first maths aide memoire – simple random/probabilistic processes are binomial – something happens or it does not. If the probability of it happening in a unit time period is $p$ (update: is this showing up as ‘nm’? It’s meant to be ‘p’!) then the probability it will not happen is $1 - p = q$.  For instance this might be the probability that an atom of Uranium 235 shows $\alpha$-particle decay (the probability that one U 235 atom will decay is given by its half-life of 700 million years ie., $2.21\times10^{16}$ seconds, or a probability, if my maths is correct, of a given individual atom decaying in any particular second of approximately $4.4\times10^{-16}$.

(In operating systems terms my thinking is that if the time pages spent in a working set were governed by similar processes then there will be a half life for every page that is read in. If we discarded pages after they were in the system after such a half life, or better yet some multiple of the half life, then we could have a simpler page replacement system – we would not need to use a CLOCK algorithm, just record the time a page entered the system and stick it in a FIFO queue and discard it when the entry time was more than a half life ago.

An even simpler case might be to just discard pages once the stored number reached above a certain ‘half life’ limit. Crude, certainly, but maybe the simplicity might compensate for the lack of sophistication.

Such a system would not work very well for a general/desktop operating system – as the graph for the MySQL daemon referred to in the previous blog shows, even one application could seem to show different distributions of working set sizes. But what if you had a specialist system where the OS only ran one application – then tuning might work: perhaps that could even apply to mass electionics devices, such as Android phones – after all the Android (Dalvik) VM is what is being run each time.)

# The graph is wrong

Once I published the graph on the previous blog entry I more or less immediately realised it was wrong – it’s not that the curves are wrong: it’s that they are plotted on different scales.

The red line plots lifetime curve using the working set of the process based on pages accessed in a certain time frame – this gives an average working set size ($\theta$) which is plotted along the x axis.

The blue line is the lifetime curve with a maximum working set of a fixed size (ie it is a simple LRU type page replacement simulation). But it is not scaled correctly against the x axis. Something of a waste of 31 hours of wallclock time!

Happily my 12 core box is now online and so I should have a replot done shortly – my best guess is that it may not change too much, things will be interesting if it does.

# Working sets compared

NB: This graph is wrong - read more here

This shows the ‘lifetime curve’ of Xterm using a time based $\theta$ for determining the working set (ie having a working set of variable size based on recency of access) and a fixed-sized working set.

The y-axis measures the average number of instructions executed between faults.

# Lifecycle of a Linux page … 2

Took another day’s worth of work to get this right … last night’s effort had a few mistakes in it as well as being not such a great piece of Metapost. So here’s the correct version:

As before, the Metapost of this image is available at http://github.com/mcmenaminadrian

# Rediscovering enthusiasm

This is the first “normal” – not abroad or just back, not jet lagged and so on – weekend I’ve been able to have at home in a month and it has also been the first time in that period where I have been able to expend some time to looking further at my proposed MSc project – on extending working set heuristics in the Linux kernel.

The good news is that I am once more convinced of the utility of, and enthusiastic about the implementation of, the idea. At the risk of looking very naive in six months (or six weeks) time even in my own eyes – here is the core idea:

Peter Denning’s 1968 and 1970 papers on the working set and virtual memory made some bold claims – calling global page replacement algorithms “in general sub-optimal” and asserting that the working set method is the best practical guarantee against thrashing.

Windows NT and its derivatives (XP, Vista, 7 etc) reflect their heritage from VMS in using a working set based replacement policy.

In contrast Linux (and the Unix family generally) use global replacement policies: indeed a fairly simple clock algorithm stands at the centre of Linux’s page replacement policies. Kernel developers say the policy works well in practice and that, in effect, the active “least recently used” list of cached pages – against which the clock algorithm runs, is a list of pages in the working sets of running processes.

My essential idea is to seek to trim the active list on a process-by-process basis when the system is under high load (the long delay in execution caused by a page fault hopefully making it efficient to execute the extra code in the hope of reducing the number of page faults.) Pages from the active list that are owned by the processes with the biggest memory footprint will be dropped into the inactive list, so making it more likely they will be eventually swapped out.

The second aspect of the application of a working set heuristic will be to alter the scheduling priorities of processes depending on their memory footprint. There are a few options here and I have not looked at this closely enough yet, but things to test could include:

• Increasing the priority of the smallest processes – on the basis these might reach the end of execution more quickly and so release memory back to the pool
• Radically lowering the priorities of the processes whose pages are being swapped out – on the basis that they do not have a working set of resources available and so, as Denning argued forty years ago, should not be able to run

In practical terms I am still some way off writing any kernel code. I have, though, written some user tools (still need polishing) to display the memory footprint of Linux processes in a red-black tree (the representation used internally by the kernel). Following Eric S Raymond (on Unix programming not politics!), the tools are partitioned into single applications that do different things – but run together they can generate graphics such as the one below:

So, on we go…