# Thrash reduction no longer a priority for Linux kernel devs?

Version 3.5 of the Linux kernel has been released.

freshly installed ipod linux, booting. during Wikipedia:Workshop Köln (Photo credit: Wikipedia)

One of the changes it includes is the removal of the “swap token” code – one of the very few ‘local’ memory management policies grafted on to the ‘global’ page replacement mechanisms in the kernel.

There are various technical reasons offered for the removal of the code – on which I am not qualified to comment – but the borrow line is that it was broken in any case, so the removal seems to make sense.

What does slightly disturb me, though, is the comment that Rik Van Riel, the key figure in kernel memory management code, makes:

The days of sub-1G memory systems with heavy use of swap are over.
If we ever need thrashing reducing code in the future, we will have to
implement something that does scale.

I think the days of sub-1G systems are far from over. In fact I suspect there are more of them, and more of them running Linux, than ever before and that trend is not going to stop.

He’s right of course about the need to find that code that works – my own efforts (in my MSc report) didn’t crack this problem, but I do think there is more that can be done.

# In praise of Peter Denning

Peter Denning (Photo credit: Wikipedia)

It’s often said one should not meet one’s heroes as, all too often, they turn out to be, well, just a bit too human. To be sure, in politics I have often been up close to people who were seen as etherial beings by many but were indeed a bit ordinary close up (though you also can see in some people an inexpressible quality of brilliance – Tony Blair and Ken Clarke both had this).

Well, I have never met Peter J. Denning, the discoverer of the “working set method”, and the man whose work formed the intellectual backdrop to my MSc project last year. But I have now exchanged a few emails with him and I do want to say that they have all increased my admiration for him as one of the great foundational figures of modern computing.

He sought me out after he came across my MSc. I have to say my first section was that he was likely to tear a strip off me (actually my very first reaction was to think he was a recruitment consultant wasting his time – it simply never occurred to me that the Peter Denning emailing me would be that Peter Denning) – as I had described his formulation of the space-time product for memory management as flawed. In fact he just pointed out that I was criticising an approximation which he accepted did not fully represent the space-time needed to manage a working set method of page replacement but also pointed out he had accounted for this in other papers (and he had).

He and I then exchanged a few emails about memory management issues and about my current research interests.

A great man. A giant of computing. And nice to boot. Who would have thought it?

# Working set heuristics and the Linux kernel: my MSc report

My MSc project was titled “Applying Working Set Heuristics to the Linux Kernel” and my aim was to test some local page replacement policies in Linux, which uses a global page replacement algorithm, based on the “2Q” principle.

There is a precedent for this: the so-called “swap token” is a local page replacement policy that has been used in the Linux kernel for some years.

My aim was to see if a local replacement policy graft could help tackle “thrashing” (when a computer spends so much time trying to manage memory resources – generally swapping pages back and forth to disk – it makes little or no progress with the task itself).

The full report (uncorrected – the typos have made me shudder all the same) is linked at the end, what follows is a relatively brief and simplified summary.

Fundamentally I tried two approaches: acting on large processes when the number of free pages fell to one of the watermark levels used in the kernel and acting on the process last run or most likely to run next.

For the first my thinking – backed by some empirical evidence – was that the largest process tended to consume much more memory than even the second largest. For the second the thought was that make the process next to run more memory efficient would make the system as a whole run faster and that, in any case the process next to run was also quite likely (and again some empirical evidence supported this) to be the biggest consumer of memory in the system.

To begin I reviewed the theory that underlies the claims for the superiority of the working set approach to memory management – particularly that it can run optimally with lower resource use than an LRU (least recently used) policy.

Peter Denning, the discoverer of the “working set” method and its chief promoter, argued that programs in execution do not smoothly and slowly change their fields of locality, but transition from region to region rapidly and frequently.

The evidence I collected – using the Valgrind program and some software I wrote to interpret its output, showed that Denning’s arguments appear valid for today’s programs.

Here, for instance is the memory access pattern of Mozilla Firefox:

Working set size can therefore vary rapidly, as this graph shows:

It can be seen that peaks of working set size often occur at the point of phase transition – as the process will be accessing memory from the two phases at the same time or in rapid succession.

Denning’s argument is that the local policy suggested by the working set method allows for this rapid change of locality – as the memory space allocated to a given program is free to go up and down (subject to the overall constraint on resources, of course).

He also argued that the working set method will – at least in theory – deliver a better space time product (a measure of overall memory use) than a local LRU policy. Again my results confirmed his earlier findings in that they showed that, for a given average size of a set of pages in memory, the working set method will ensure longer times between page faults, compared to a local LRU policy – as shown in this graph:

Here the red line marks the theoretical performance of a working set replacement policy and the blue line that of a local LRU policy. The y-axis marks the average number of instructions executed between page faults, the x-axis the average resident set size. The working set method clearly outperforms the LRU policy at low resident set values.

The ‘knee’ in either plot where $\frac{dy}{dx}$ is maximised is also the point of lowest space time product – at this occurs at a much lower value for the working set method than for local LRU.

So, if Denning’s claims for the working set method are valid, why is it that no mainstream operating system uses it? VMS and Windows NT (which share a common heritage) use a local page replacement policy, but both are closer to the page-fault-frequency replacement algorithm – which varies fixed allocations based on fault counts – than a true working set-based replacement policy.

The working set method is just too difficult to implement – pages need to be marked for the time they are used and to really secure the space-time product benefit claimed, they also need to be evicted from memory at a specified time. Doing any of that would require specialised hardware or complex software or both, so approximations must be used.

“Clock pressure”

For my experiments I concentrated on manipulating the “CLOCK” element of the page replacement algorithm: this removes or downgrades pages if they have not been accessed in the time been alternate sweeps of an imaginary second hand of an equally imaginary clock. “Clock pressure” could be increased – ie., pages made more vulnerable to eviction – by systematically marking them as unaccessed, while pages could be preserved in memory by marking them all as having been accessed.

The test environment was compiling the Linux kernel – and I showed that the time taken for this was highly dependent on the memory available in a system:

The red line suggests that, for all but the lowest memory, the compile time is proportional to $M^{-4}$ where $M$ is the system memory. I don’t claim this a fundamental relationship, merely what was observed in this particular set up (I have a gut feeling it is related to the number of active threads – this kernel was built using the -j3 switch and at the low memory end the swapper was probably more active than the build, but again I have not explored this).

Watermarks

The first set of patches I tried were based on waiting for free memory in the system to sink to one of the “watermarks” the kernel uses to trigger page replacement. My patches looked for the largest process then either looked to increase clock pressure – ie., make the pages from this large process more likely to be removed – or to decrease it, ie., to make it more likely these pages would be preserved in memory.

In fact the result in either case was similar – at higher memory values there seemed to be a small but noticeable decline in performance but at low memory values performance declined sharply – possibly because moving pages from one of the “queues” of cached pages involves locking (though, as later results showed also likely because the process simply is not optimal in its interaction with the existing mechanisms to keep or evict pages).

The graph below shows a typical result of an attempt to increase clock pressure – patched times are marked with a blue cross.

The second approach was to interact with the “completely fair scheduler” (CFS) and increase or decrease clock pressure on the process lease likely to run or most likely to run.

The CFS orders processes in a red-black tree (a semi-balanced tree) and the rightmost node is the process least likely to run next and the leftmost the process most likely to run next (as it has run for the shortest amount of virtual time).

As before the idea was to either free memory (increase clock pressure) or hold needed pages in memory (decrease clock pressure). The flowchart below illustrates the mechanism used for the leftmost process (and decreasing clock pressure):

But again the results were generally similar – a general decline, and a sharp decline at low memory values.

(In fact, locking in memory of the leftmost process actually had little effect – as shown below:)

But when the same approach was taken to the rightmost process – ie the process that has run for the longest time (and presumably may also run for a long time in the future), the result was a catastrophic decline in performance at small memory values:

And what is behind the slowdown? Using profiling tools the biggest reason seems to be that the wrong pages are being pushed out of the caches and  need to be fetched back in. At 40MB of free memory both patched and unpatched kernels show similar profiles with most time spent scheduling and waiting for I/O requests – but the slowness of the patched kernel shows that this has to be done many more times there.

There is much more in the report itself – including an examination of Denning’s formulation of the space-time product  - I conclude his is flawed (update: in fairness to Peter Denning, who has pointed this out to me, this is as regards his approximation of the space-time product: Denning’s modelling in the 70s also accounted for the additional time that was required to manage the working set) as it disregards the time required to handle page replacement – and the above is all a (necessary) simplification of what is in the report – so if you are interested please read that.

Applying working set heuristics to the Linux kernel

# Hard faults and soft faults in the real world

I ran Audacity under valext and here is the graph of real memory use:

And here is the soft and hard fault count:

My surmise as to what you can see here? Lots of initialising – with memory use shooting up and down – though the low level of hard faults suggests much of this is from libraries already loaded in the system. Then pages getting swapped out as nothing happens – audacity did not actually display a window – not sure why – I killed it after it had been running for about two and half minutes of virtual time (around 24 hours of wall clock time) as that was more than enough time to produce something on screen!

Still, as a first test of the tool, that was not bad.

# Counting soft and hard faults

Image via Wikipedia

When a running program references a page of memory that is not mapped into its address space the operating system throws a “page fault” – calling some kernel code to ensure that the page is loaded and mapped, or if the address referenced is not legal, that an appropriate error (a seg fault on x86) is signalled and the program’s execution stopped.

If the address is ‘legal’ then two types of fault exist – a ‘hard’ fault where the missing memory (eg some code) has to be loaded from disk or a ‘soft’ fault where the missing page is already in memory (typically because it is in a shared library and being used elsewhere) and so all that has to happen is for the page to be mapped into the address space of the executing program.

(The above is all a simplification, but I hope it is clear enough.)

Soft faults, as you might expect are handled much faster than hard faults – as disk access is generally many orders of magnitude slower than memory access.

Memory management and paging policies are generally designed to minimise the number of faults, especially hard faults.

So – what is the ratio of hard and soft faults? I have further extended the valext program I wrote for my MSc project to count just that – and it seems that on a loaded Linux system soft faults are generally an order of magnitude more common than hard faults even when launching a new program form ‘scratch’ (eg I am seeking to run an instance of ‘audacity’ under valext – and after executing 326,000 instructions there have been 274 soft faults and 37 hard faults).

That is good, of course, because it makes for faster, more efficient computing. But it also means that further optimising the paging policy of Linux is tough – hard faults take time so you can run a lot of optimising code and hope to have a better performance if you cut the number of hard faults even only slightly. But if soft faults out number hard faults by 10 to 1 then running a lot of extra code to cut the number of faults may not be so beneficial.

(You can download valext at github – here NB: right now this extra feature is in the ‘faultcount’ branch – it will be merged into master in due course.)

# Writing more code to avoid writing any of the report?

Image by mrbill via Flickr

I have managed to churn out 160 lines of working C today – which I think is quite good going, though, according to this, maybe I could have churned out 400 of C++ or even 960 of Perl (I love Perl but the mind boggles).

My program will tell you how pages pages are present or how many have been swapped (it has the ability to do a bit more too, but I have not exploited that even though it is essentially there) – you can fetch it from github here: valext (the name reflects the idea that this was going to be an extension to Valgrind but then I discovered/realised that was never going to work).

Anyway, what I now have to face up to is whether I am churning out code to avoid actually writing any of my MSc project report – I have a feeling that some of that is going on and think I need to start rereading Writing for Computer Science – which helped me greatly with the proposal.

# Performance collapse in the Open JVM

Image via Wikipedia

Unfortunately, I do not have time to investigate this further myself, but others may do.

But yesterday I had a serious performance issue with the (open) JVM – though I was able to solve it with an algorithm change – swapping the problematic (integer) code for a lot of floating point maths: not the usual way to fix a performance issue but one that works.

My original code (in Groovy) appended many millions of integers to a list and then, once a loop was complete, calculated the average for the list (calculating the average working set size for a running process). When I was dealing with 2 – 3 million integers it worked well and performance, if anot exactly zipping along, was good. Push that up to 10 – 11 million and the first couple of times through the loop CPU utilisation dropped precepitatively (this was multithreaded – with runs through the loop operating in parallel) but the code was still visibly working but after that the intervals between loop completion grew to the point that the code seemed to have failed.

Even when I pre-allocated 0×1000000 items in what I now explicitly declared as an ArrayList the performance was little better – the first couple of iterations seemed a bit faster but performance then died.

I do not know what is going on – though excessive memory fragmentation perhaps coupled with poor garbage collection seem like the obvious answers: seems there is probably a brick wall for ArrayList size that sees whatever memory allocation algorithm operates inside the JVM fall over.

How did I fix it? Update the average in real time – in pseudo code below:

average = 0 previousInstructions = 0 loop [0, maxInstructions - 1) { currentInstructions++ if (change_in_working_set_size) { average = ((average * previousInstructions) + workingSetSize() * (currentInstructions - previousInstructions))/currentInstructions previousInstructions = currentInstructions } }

Like I say, floating point, but it works.

# Itsy-OS kernel: who needs memory protection anyway?

Image via Wikipedia

I have to say I love this little piece of code – a 380 byte “operating system kernel” for the Intel 8086. Actually just a task switcher and close to the simplest memory management imaginable. But if you had a device that you could be happy just to switch on and off if and when something crashes, this could even be useful.