Version 3.5 of the Linux kernel has been released.
There are various technical reasons offered for the removal of the code – on which I am not qualified to comment – but the borrow line is that it was broken in any case, so the removal seems to make sense.
What does slightly disturb me, though, is the comment that Rik Van Riel, the key figure in kernel memory management code, makes:
The days of sub-1G memory systems with heavy use of swap are over.
If we ever need thrashing reducing code in the future, we will have to
implement something that does scale.
I think the days of sub-1G systems are far from over. In fact I suspect there are more of them, and more of them running Linux, than ever before and that trend is not going to stop.
He’s right of course about the need to find that code that works – my own efforts (in my MSc report) didn’t crack this problem, but I do think there is more that can be done.
- Linux Kernel 3.4.6 Is Available for Download (news.softpedia.com)
- Microsoft Apologizes For Inserting Naughty Phrase Into Linux Kernel (linux.slashdot.org)
- Download Linux Kernel 3.5 Release Candidate 7 (news.softpedia.com)
- Vincent Sanders: Linux kernel presentation (vincentsanders.blogspot.com)
The “punchcard” tool shows what times commits are made. And here it is for the Linux kernel:
With more and more additions to the kernel coming straight off git pulls, this pattern must reflect rather more than Linus Torvalds‘ own office habits.
It looks like the image of kernel hackers as nerds pulling all-nighters with the help of “rotary debuggers” (see Hackers: Heroes of the Computer Revolution for more of that) is well past its use-by date: building Linux is just a job.
It’s like the curse of the software demonstration: it doesn’t break until then.
I discovered as soon as I posted that I was ready to (try to) push the VMUFAT stuff up to main line that there was a bug in the software.
Very large VMUFAT volumes were not being properly handled. But I think I have fixed that now. Some more testing is due, though, before I proclaim victory a second time!
- VMUFAT: almost done (I hope) (cartesianproduct.wordpress.com)
A few attempts to get this in the kernel mainline then followed. It was a bruising experience and unsuccessful. But I am about to try again.
I am a bit more confident this time – not least because I have written some userland code which will allow anyone to test the filesystem out, whether they have a VMU or not: mkfs.vmufat is now available at GitHub – https://github.com/mcmenaminadrian/mkfs.vmufat/blob/master/mkfs.vmufat.c
Secondly I do think I am a better coder thanks to the MSc and have put some effort into fixing the filesystem code itself.
But we’ll see, hopefully tomorrow, how it goes down.
- One of those debugging nights (cartesianproduct.wordpress.com)
- Filesystem works (cartesianproduct.wordpress.com)
- Fan Made Sega Dreamcast Revival Has Me Wishing Sega Still Made Consoles (devicemag.com)
- Fedora puts back Btrfs deployment yet again (h-online.com)
- Maintenance of Linux kernel 2.6.32 is slowing down (h-online.com)
My MSc project was titled “Applying Working Set Heuristics to the Linux Kernel” and my aim was to test some local page replacement policies in Linux, which uses a global page replacement algorithm, based on the “2Q” principle.
There is a precedent for this: the so-called “swap token” is a local page replacement policy that has been used in the Linux kernel for some years.
My aim was to see if a local replacement policy graft could help tackle “thrashing” (when a computer spends so much time trying to manage memory resources – generally swapping pages back and forth to disk – it makes little or no progress with the task itself).
The full report (uncorrected – the typos have made me shudder all the same) is linked at the end, what follows is a relatively brief and simplified summary.
Fundamentally I tried two approaches: acting on large processes when the number of free pages fell to one of the watermark levels used in the kernel and acting on the process last run or most likely to run next.
For the first my thinking – backed by some empirical evidence – was that the largest process tended to consume much more memory than even the second largest. For the second the thought was that make the process next to run more memory efficient would make the system as a whole run faster and that, in any case the process next to run was also quite likely (and again some empirical evidence supported this) to be the biggest consumer of memory in the system.
To begin I reviewed the theory that underlies the claims for the superiority of the working set approach to memory management – particularly that it can run optimally with lower resource use than an LRU (least recently used) policy.
Peter Denning, the discoverer of the “working set” method and its chief promoter, argued that programs in execution do not smoothly and slowly change their fields of locality, but transition from region to region rapidly and frequently.
The evidence I collected – using the Valgrind program and some software I wrote to interpret its output, showed that Denning’s arguments appear valid for today’s programs.
Here, for instance is the memory access pattern of Mozilla Firefox:
Working set size can therefore vary rapidly, as this graph shows:
Denning’s argument is that the local policy suggested by the working set method allows for this rapid change of locality – as the memory space allocated to a given program is free to go up and down (subject to the overall constraint on resources, of course).
He also argued that the working set method will – at least in theory – deliver a better space time product (a measure of overall memory use) than a local LRU policy. Again my results confirmed his earlier findings in that they showed that, for a given average size of a set of pages in memory, the working set method will ensure longer times between page faults, compared to a local LRU policy – as shown in this graph:
Here the red line marks the theoretical performance of a working set replacement policy and the blue line that of a local LRU policy. The y-axis marks the average number of instructions executed between page faults, the x-axis the average resident set size. The working set method clearly outperforms the LRU policy at low resident set values.
The ‘knee’ in either plot where is maximised is also the point of lowest space time product – at this occurs at a much lower value for the working set method than for local LRU.
So, if Denning’s claims for the working set method are valid, why is it that no mainstream operating system uses it? VMS and Windows NT (which share a common heritage) use a local page replacement policy, but both are closer to the page-fault-frequency replacement algorithm – which varies fixed allocations based on fault counts – than a true working set-based replacement policy.
The working set method is just too difficult to implement – pages need to be marked for the time they are used and to really secure the space-time product benefit claimed, they also need to be evicted from memory at a specified time. Doing any of that would require specialised hardware or complex software or both, so approximations must be used.
For my experiments I concentrated on manipulating the “CLOCK” element of the page replacement algorithm: this removes or downgrades pages if they have not been accessed in the time been alternate sweeps of an imaginary second hand of an equally imaginary clock. “Clock pressure” could be increased – ie., pages made more vulnerable to eviction – by systematically marking them as unaccessed, while pages could be preserved in memory by marking them all as having been accessed.
The test environment was compiling the Linux kernel – and I showed that the time taken for this was highly dependent on the memory available in a system:
The red line suggests that, for all but the lowest memory, the compile time is proportional to where is the system memory. I don’t claim this a fundamental relationship, merely what was observed in this particular set up (I have a gut feeling it is related to the number of active threads – this kernel was built using the -j3 switch and at the low memory end the swapper was probably more active than the build, but again I have not explored this).
The first set of patches I tried were based on waiting for free memory in the system to sink to one of the “watermarks” the kernel uses to trigger page replacement. My patches looked for the largest process then either looked to increase clock pressure – ie., make the pages from this large process more likely to be removed – or to decrease it, ie., to make it more likely these pages would be preserved in memory.
In fact the result in either case was similar – at higher memory values there seemed to be a small but noticeable decline in performance but at low memory values performance declined sharply – possibly because moving pages from one of the “queues” of cached pages involves locking (though, as later results showed also likely because the process simply is not optimal in its interaction with the existing mechanisms to keep or evict pages).
The graph below shows a typical result of an attempt to increase clock pressure – patched times are marked with a blue cross.
The CFS orders processes in a red-black tree (a semi-balanced tree) and the rightmost node is the process least likely to run next and the leftmost the process most likely to run next (as it has run for the shortest amount of virtual time).
As before the idea was to either free memory (increase clock pressure) or hold needed pages in memory (decrease clock pressure). The flowchart below illustrates the mechanism used for the leftmost process (and decreasing clock pressure):
But again the results were generally similar – a general decline, and a sharp decline at low memory values.
(In fact, locking in memory of the leftmost process actually had little effect – as shown below:)
But when the same approach was taken to the rightmost process – ie the process that has run for the longest time (and presumably may also run for a long time in the future), the result was a catastrophic decline in performance at small memory values:
And what is behind the slowdown? Using profiling tools the biggest reason seems to be that the wrong pages are being pushed out of the caches and need to be fetched back in. At 40MB of free memory both patched and unpatched kernels show similar profiles with most time spent scheduling and waiting for I/O requests – but the slowness of the patched kernel shows that this has to be done many more times there.
There is much more in the report itself – including an examination of Denning’s formulation of the space-time product - I conclude his is flawed (update: in fairness to Peter Denning, who has pointed this out to me, this is as regards his approximation of the space-time product: Denning’s modelling in the 70s also accounted for the additional time that was required to manage the working set) as it disregards the time required to handle page replacement – and the above is all a (necessary) simplification of what is in the report – so if you are interested please read that.
- Best book on Linux kernel internals (cartesianproduct.wordpress.com)
- Is the time pages are in the working set stochastic? (cartesianproduct.wordpress.com)
- Counting soft and hard faults (cartesianproduct.wordpress.com)
- Building The Linux Kernel In 60 Seconds (phoronix.com)
- Update Your Linux Kernel Without Reboot (hackstips.wordpress.com)
Reading about the Monte Carlo method has set me thinking about this and how, if at all, it might be applied to page reclaim in the Linux kernel.
In my MSc report I show that my results show that working set size is not normally distributed – despite occasional claims to the contrary in computer science text books. But it is possible that a series of normal distributions are overlaid – see the graphic below:
(I may find out how I have done in the degree in the next week or so – wish me luck)
- A. Papoulis, Probability, Random Variables and Stochastic Processes (renormalizationgroupllc.net)
- Done and dusted (cartesianproduct.wordpress.com)
- Mathematician, Architect But Essentially Revolutionizing Music: Iannis Xenakis (mehmetokonsar.wordpress.com)
- Fast Stochastics vs Slow Stochastics (managemenresiko.wordpress.com)
- Best book on Linux kernel internals (cartesianproduct.wordpress.com)
- Data Management: Stochastic Simulation Utilization (bjconquest.com)
- Stochastic process etc. (kourelarios.wordpress.com)
- Linux Kernel Vulnerability Discovered on Ubuntu 11.10 (pitchengine.com)
For the last six years my job situation has made me wary of commenting on the politics of the free software movement and its enemies, but I have just changed jobs (been a busy week round here) and now I feel I comment freely on what every free software advocate has always known as Public Enemy Number One: Microsoft.
Except, now the time has come, I don’t really feel they are any more. Indeed I feel free software is more threatened by one of the teams that is meant to be on our side – Google, which has embraced the world of software patents and user abandonment with enthusiasm and by another, Apple, that uses Unix but seems oblivious to the idea of user freedom.
Now, I am not saying that, because Microsoft have found themselves in recent months as one of the biggest contributors to the Linux kernel, and marked the 20th birthday of Linus’s famous usenet announcement of a new Unix-like kernel with an appeal for more co-operation, they really are our friends. But we should recognise that the change – even if it did come because they were dragged through the US and EU courts – is a real one. They are at least recognising that we are here to stay and that the server room of the future will be running multiple OSes on virtualised machines of different flavours.
On the desktop our side is still nowhere – perhaps still under 1% globally and the boys and girls in Seattle might quickly turn nasty again if we ever did start to crack that nut, but in the meantime we maybe should be testing just how sincere that offer of co-operation really is… afterall they are not offering to work with us because they think we are weak!
Update; Thanks to the retweeters. Think I should point out my proposal of a compromiso historico with Microsoft is a minority view – as the links below probably suggest.
- As Attachmate Distances Itself From SUSE, Commitment to UNIX Copyrights Doubted (techrights.org)
- Microsoft Charm (techcrunch.com)
- Can We Trust Microsoft? (ostatic.com)
- Linux at 20: New Challenges, New Opportunities (pcworld.com)
- Microsoft Plays Dirty PR Games to Daemonise Sceptics of Convicted Criminals (techrights.org)
- A Brief History of Linux (techebook.wordpress.com)
I submitted my MSc project report yesterday, so that is it, at least for now, as a computer science student.
The report was on “applying working set heuristics to the Linux kernel“: essentially testing to see if there were ways to overlay some elements of local page replacement to the kernel’s global page replacement policy that would speed turnaround times.
The answer to that appears to be ‘no’ – at least not in the ways I attempted, though I think there may be some ways to improve performance if some serious studies of phases of locality in programs gave us a better understanding of ways to spot the end of one phase and the beginning of another.
But, generally speaking, my work showed the global LRU policy of the kernel was pretty robust.
I find out how I did in November.
Need to find some other programming task now. Mad bit of me suggests getting engaged with GNU Hurd. Though mucking about with Android also has an appeal.
For one-last-thing with my report I want to profile the kernel in a specific configuration and so thought I would try oprofile instead of the cruder profile=X command line options.
Essentially I could not get it to run under KVM at all. KVM hides many hardware details from the profiler and set up is notoriously difficult (I now know). Apparently this is fixable, but not simply and there is very little information out there about how to do it. If I had days to learn maybe I would persevere, but I don’t.
But I did come across one feature/bug in oprofile that I will document a fix for in the hope it proves useful to someone.
To start oprofile off (to profile the kernel), one has to specify where a
vmlinux file (note, not a compressed
bzImage etc) or similar is.
Mine were of the format
vmlinux-3.0.0-sched+ but oprofile consistently failed to let me specify that: again I did not have time to go into the details but is was clear it was the
+ that was the issue. I renamed the file and all was fine.
- Linux takes a page from Solaris… pmap and available on Linux. (glennfawcett.wordpress.com)
- Linux Performance Tips From MySql Conf2010 (alanristic.com)
- Linux is now hosted on GitHub (github.com)
- imabonehead: Raymond’s Blog: HOWTO: Using Microsoft Kinect on Tegra Ventana (Android 3.0) (raymondlo84.blogspot.com)
- imabonehead: Compiling Linux kernel for QEMU ARM emulator ” Balau (balau82.wordpress.com)
I have three books on the kernel – there are obviously others, but I think two of these three will be familiar to most kernel hackers – but it is the third that I rate most highly.
Understanding the Linux Kernel: for many this must feel like the standard text on the kernel – it’s published by O’Reilly, so (my experience with their XML Pocket Reference not withstanding) will be good, it’s well printed and readable and it is, after all, the third edition of a book your forefathers used. But the problem is, it is also now six years old and a lot has happened since then. I well remember going into Foyles in the autumn of 2005 and seeing it newly minted on the shelves. For a long time it could hide behind the fact that the kernel was still in the 2.6 series, but even that protection is gone. Verdict: Venerable but getting close to past it.
Linux Kernel Development: the previous, second, edition of this book was a fantastic introduction to how the kernel worked and was written in a slightly whimsical tone which made it easier to read. It is rare that one can read a computer book like a novel, starting from the first page and going on to the end, but you could with that. Sadly someone seems to have got to Robert Love (who, from personal experience I know to be a great guy) and presumably told him he had to be more serious if he wanted his book to be a set text for CS courses. The problem is that the book now falls between two stools – somewhere between a solid but broad introduction to how the kernel works and a guide to writing kernel code. Unfortunately, it does not quite hit either target. That said, it is still worth having. Verdict: Good, but where did the magic go?
Professional Linux Kernel Architecture : unfortunately, everything about the Wrox brand suggests “cheap and nasty”, which is a real pity, as this book is the best of the bunch. Admittedly it would not, as Robert Love’s book, be at all suitable as a primer for operating system study – it is far too big and detailed for that. But if you are looking for an up to date (or reasonably so, anyway) helpmate for actually patching the kernel, then this seems to be the best choice. Sadly the cheap and nasty side does creap through on occasion with bad editing/translation, but it’s not enough to stop me from recommending it. Verdict: this one’s a keeper, for now any way.
- Linux OS developers breached by trojan (go.theregister.com)
- * Urgent Linux Kernal/ internal for Aricent-Chennai (referaljobs.wordpress.com)