# Time to write a signal handler?

I am trying to execute some self-written pieces of software that require a lot of wall clock time – around three weeks.

I run them on the University of York‘s compute server which is rebooted on the first Tuesday of every month, so the window for the software is limited. I have until about 7 am on 5 August before the next reboot.

To add to the complication the server runs Kerberos which does not seem to play well with the screen/NFS combination I am using.

And – I keep killing the applications in error – this time, just half an hour ago I assume I was on a dead terminal session (ie an ssh login which had long since expired) and pressed ctrl-C, only to my horror to discover it was a live screen (it had not responded to ctrl-A, ctrl-A for whatever reason).

Time to add a signal handler to catch ctrl-C to at least give me the option of changing my mind!

# Reasoning about a NoC

Thinking about a Network-on-Chip system and what its system software needs to do…

• Parallelisation is essential to efficiency – in a NoC there are a multitude of cores, but each core has only the fraction of the computational power a “traditional” unicore might be expected to have – therefore it is essential that, where possible, code is parallelised across as many cores as possible;
• Each core needs to be able to access operating system services (via system calls or some other mechanism), but it is not necessarily the case that each core has to run a full or even a partial operating system – thus RPC or some other mechanism can be used to ‘remotely’ provide system services;
• Application programmers want, above all, a single address space.

# Twenty years of Windows NT

Today is the twentieth anniversary of the launch of Windows NT. In fact I have been using it (I still have to – in the sense that Windows XP/7/8 are NT – on occasion) for a bit longer than that as I was a member of the NT beta programme – I even managed to write some software for it using the 32 bit SDK before it was officially released (a not very good version of exie-ohsies/naughts-and-crosses/tic-tac-toe – so poor was the coding that you could beat the computer every time if you knew its weakness).

NT would not run on my Amstrad 386 and in the end I bought a cheap 486 machine to match the software – it was a lot of fun in the early days – though I shudder to think of trying to get by, day by day, on a machine with such a weak command line.

One thing I remember was the late, great, Byte magazine running an issue in the summer of 1992 predicting that NT would be the death of Unix. I even thought that was right – Unix was for expensive workstations and now us users of commodity hardware were to get a proper multi-tasking 32 bit operating system – who needed time-sharing when we could all have PCs of our own?

# Computer science in the UK: in the wrong direction?

Two big thoughts strike me as a result of the literature review I have just completed for my PhD:

• Linux is not the centre of the universe, in fact it is a bit of an intellectual backwater;
• The UK may have played as big a role in the invention of the electronic computer as the US, but these days it is hardly even in the game in many areas of computing research.

On the first point I am in danger of sounding like Andy “Linux is obsolete” Tanenbaum – but it is certainly the case that Linux is far from the cutting edge in operating system research. If massively parallel systems do break through to the desktop it is difficult to imagine they will be running Linux (or any monolithic operating system).

In fact the first generation may do – because nobody has anything else right now – but Linux will be a kludge in that case.

Doing my MSc which did focus on a Linux related problem, it seemed to me that we had come close to “the end of history” in  operating system research – ie the issue now was fine tuning the models we had settled on. The big issues had been dealt with in the late 60s, the 70s and the 80s.

Now I know different. Operating systems research is very vibrant and there are lots of new models competing for attention.

But along the way the UK dropped out of the picture. Read some papers on the development of virtual memory and it will not be long before you come across the seminal experiment conducted on EMAS – the Edinburgh University Multiple Access System – which was still in use when I was there in 1984. Now you will struggle to find any UK university – with the limited exceptions of Cambridge and York (for real-time) – making any impact (at least that’s how it seems to me).

It’s not that the Americans run the show either – German, Swiss and Italian universities are leading centres of research into systems software.

I am not sure how or why the UK slipped behind, but it feels like a mistake to me – especially as I think new hardware models are going to drive a lot of software innovation in the next decade (well, I would say that, wouldn’t I?)

# A Raspberry Pi project?

My Raspberry Pis are likely to be dispatched tomorrow.

Wondering if I should use the little Plan 9 installation I plan to build as a testbed for a parallel filesystem (that I write).

As preparation for next week’s literature review seminar I read about the Vesta filesystem and find it absolutely fascinating. Could we build a new filesystem – Pallas? – which combines the Vesta approach with support for Unix semantics?

# Plan9 on the Raspberry Pi

Plan 9 from Bell Labs” was meant to be the successor system to Unix and like the original was designed and built by AT&Ts engineers at Bell Labs(the title is, of course, a skit on what is supposedly the best worst-ever film – “Plan 9 from Outer Space”).

Plan 9 never really made it. Linux came along and gave us Unix for the masses on cheap hardware for free and the world moved on. (Though some of the ideas in Plan 9 were retro-fitted into Linux and other Unix-like systems.)

The increased speed of commodity computers – latterly sustained via SMP – meant that computing power that once seemed only available to the elite could be found on the High Street and easy to use and install clustering software meant scientists and others could build super-computers using cheap hardware and free software. The multi-computer idea at the heart of Plan 9 seemed to have been passed-by as we screamed along the Moore’s Law Highway.

But now Moore’s Law is breaking down – or rather we are discovering that while the Law continues to apply – in other words we can still double the number of transistors on silicon every 18 – 24 months – other factors (heat dissipation essentially) mean we cannot translate a doubling of transistors into a computer that runs twice as fast. And so the multi-computer idea is of interest once more.

Plan 9 is not likely to be the operating system of the future. But as an actually existing multi-computer operating system it could still have a lot to teach us.

Now it has been ported to run on the Raspberry Pi single board computer I have decided to order another three of these things (I already have one running as a proxy server) and use them as Plan 9 nodes. The boards should be here in about three weeks (I hope), meaning I will have them as a Christmas present to myself.

# OS/2: killed by Bill?

Here is a fascinating account of the rise and fall of OS/2, the operating system that was supposed to seal IBM’s (and Microsoft’s) global domination. Instead it flopped, being beaten by a poorer quality alternative in the form of Windows 3.0/3.1 after Microsoft pulled out.

I remember when Windows NT was launched in 1993 – one of its selling points was its ability to run OS/2 1.0 software natively via a dedicated subsystem (strange to remember, but then Microsoft went heavy on the modular nature of NT and its ability to run on non-Intel hardware and to support different software on top of the microkernel).

I could only ever find one free piece of native OS/2 software to run – a hex editor. A fundamentally vital workhorse for any programmer yet good implementations always seem to be in short supply (even now – last month I seriously considered writing my own so fed up was I with the basic offerings that come with Linux flavours). This one – its name escapes me – was a goodie though and I was a bit cheesed off when an NT upgrade (to 3.5) broke it. By then Microsoft plainly did not care much for software compatibility (or for NT’s ability to run on different platforms – that was scrapped too).

Still, OS/2 had its fans. As a journalist reporting on housing I went to see a public sector housing manager in rural Somerset at about this time: he was pioneering a new software system for his district housing offices and OS/2, with its revolutionary object-orientated desktop (which is what right clicking is all about) was to be at the core of that – with housing officers treating the desktop like various forms on which they could order repairs and so on. It was difficult not to share his enthusiasm because the idea, now a commonplace, that objects on the desktop could be more than just program icons was so new and exciting.

The article lists the ways in which Microsoft went all out to kill OS/2 and, in every practical sense, they succeeded. Those who doubt the need for free-as-in-freedom software should consider that. But it also lists various places where OS/2 is still in use (in the US). Anyone know of similar examples in the UK?

# Computer scientists’ lousy citation style

I am reading this book: Soft Real-Time Systems: Predictability vs. Efficiency, and I am struck, once again, by the truly lousy style of publication reference that seem to be preferred by so many

computer scientists,

The style used in the book appears to be that favoured by the American Mathematical Society – the so-called “authorship trigraph” – with references made up of letters from the author’s name followed by the last two figures of the year of original publication eg., [Bak91] which references in the bibliography:

[Bak91]        T.P. Baker. Stack-based scheduling of real-time processes. Journal of Real Time Systems, 3, 1991.

Now it is possible, if I were an expert in the field that I might recognise this reference, but it is far from helpful. When referencing papers written by multiple authors the system is hopeless – using the first letters of the first three authors and ignoring the rest, eg., $[DGK^+02]$ is a real reference in the book to a paper with eight authors. I really doubt many people would get that straight away.

But at least this reference system contains slightly more than the IEEE‘s citation system, which demands papers are merely referenced by a bracketed number in the text, eg., [1].

These reference systems are so widely used that I worried that my own use of the Chicago system – which specifies author and year, eg., (Denning, 1970), would be frowned upon in Birkbeck – but a re read of the regulations showed their demand was for a consistent and well-recognised system.

The ACM, too, promote a sensible citation format eg., [Denning 1970].

Does this matter? Yes. I am sure many readers of these books and papers are students who are conducting literature reviews or similar exercises. Reading the original reference may often be important and having to flick back and forth to a bibliography to check the meaning of an incomprehensible reference is not calaculated to add to the sum of human happiness.

(I don’t have any real complaints about the book though – except that the translation is plainly a bit stilted – for instance, the first sentence of the book refers to real time systems being investigated “in the last years” – a fairly common mistake in syntax from non-English speakers and one that the editors really ought to have corrected. But the occasional infelicity of language does not detract from the book’s overall appeal.)

# Working set heuristics and the Linux kernel: my MSc report

My MSc project was titled “Applying Working Set Heuristics to the Linux Kernel” and my aim was to test some local page replacement policies in Linux, which uses a global page replacement algorithm, based on the “2Q” principle.

There is a precedent for this: the so-called “swap token” is a local page replacement policy that has been used in the Linux kernel for some years.

My aim was to see if a local replacement policy graft could help tackle “thrashing” (when a computer spends so much time trying to manage memory resources – generally swapping pages back and forth to disk – it makes little or no progress with the task itself).

The full report (uncorrected – the typos have made me shudder all the same) is linked at the end, what follows is a relatively brief and simplified summary.

Fundamentally I tried two approaches: acting on large processes when the number of free pages fell to one of the watermark levels used in the kernel and acting on the process last run or most likely to run next.

For the first my thinking – backed by some empirical evidence – was that the largest process tended to consume much more memory than even the second largest. For the second the thought was that make the process next to run more memory efficient would make the system as a whole run faster and that, in any case the process next to run was also quite likely (and again some empirical evidence supported this) to be the biggest consumer of memory in the system.

To begin I reviewed the theory that underlies the claims for the superiority of the working set approach to memory management – particularly that it can run optimally with lower resource use than an LRU (least recently used) policy.

Peter Denning, the discoverer of the “working set” method and its chief promoter, argued that programs in execution do not smoothly and slowly change their fields of locality, but transition from region to region rapidly and frequently.

The evidence I collected – using the Valgrind program and some software I wrote to interpret its output, showed that Denning’s arguments appear valid for today’s programs.

Here, for instance is the memory access pattern of Mozilla Firefox:

Working set size can therefore vary rapidly, as this graph shows:

It can be seen that peaks of working set size often occur at the point of phase transition – as the process will be accessing memory from the two phases at the same time or in rapid succession.

Denning’s argument is that the local policy suggested by the working set method allows for this rapid change of locality – as the memory space allocated to a given program is free to go up and down (subject to the overall constraint on resources, of course).

He also argued that the working set method will – at least in theory – deliver a better space time product (a measure of overall memory use) than a local LRU policy. Again my results confirmed his earlier findings in that they showed that, for a given average size of a set of pages in memory, the working set method will ensure longer times between page faults, compared to a local LRU policy – as shown in this graph:

Here the red line marks the theoretical performance of a working set replacement policy and the blue line that of a local LRU policy. The y-axis marks the average number of instructions executed between page faults, the x-axis the average resident set size. The working set method clearly outperforms the LRU policy at low resident set values.

The ‘knee’ in either plot where $\frac{dy}{dx}$ is maximised is also the point of lowest space time product – at this occurs at a much lower value for the working set method than for local LRU.

So, if Denning’s claims for the working set method are valid, why is it that no mainstream operating system uses it? VMS and Windows NT (which share a common heritage) use a local page replacement policy, but both are closer to the page-fault-frequency replacement algorithm – which varies fixed allocations based on fault counts – than a true working set-based replacement policy.

The working set method is just too difficult to implement – pages need to be marked for the time they are used and to really secure the space-time product benefit claimed, they also need to be evicted from memory at a specified time. Doing any of that would require specialised hardware or complex software or both, so approximations must be used.

“Clock pressure”

For my experiments I concentrated on manipulating the “CLOCK” element of the page replacement algorithm: this removes or downgrades pages if they have not been accessed in the time been alternate sweeps of an imaginary second hand of an equally imaginary clock. “Clock pressure” could be increased – ie., pages made more vulnerable to eviction – by systematically marking them as unaccessed, while pages could be preserved in memory by marking them all as having been accessed.

The test environment was compiling the Linux kernel – and I showed that the time taken for this was highly dependent on the memory available in a system:

The red line suggests that, for all but the lowest memory, the compile time is proportional to $M^{-4}$ where $M$ is the system memory. I don’t claim this a fundamental relationship, merely what was observed in this particular set up (I have a gut feeling it is related to the number of active threads – this kernel was built using the -j3 switch and at the low memory end the swapper was probably more active than the build, but again I have not explored this).

Watermarks

The first set of patches I tried were based on waiting for free memory in the system to sink to one of the “watermarks” the kernel uses to trigger page replacement. My patches looked for the largest process then either looked to increase clock pressure – ie., make the pages from this large process more likely to be removed – or to decrease it, ie., to make it more likely these pages would be preserved in memory.

In fact the result in either case was similar – at higher memory values there seemed to be a small but noticeable decline in performance but at low memory values performance declined sharply – possibly because moving pages from one of the “queues” of cached pages involves locking (though, as later results showed also likely because the process simply is not optimal in its interaction with the existing mechanisms to keep or evict pages).

The graph below shows a typical result of an attempt to increase clock pressure – patched times are marked with a blue cross.

The second approach was to interact with the “completely fair scheduler” (CFS) and increase or decrease clock pressure on the process lease likely to run or most likely to run.

The CFS orders processes in a red-black tree (a semi-balanced tree) and the rightmost node is the process least likely to run next and the leftmost the process most likely to run next (as it has run for the shortest amount of virtual time).

As before the idea was to either free memory (increase clock pressure) or hold needed pages in memory (decrease clock pressure). The flowchart below illustrates the mechanism used for the leftmost process (and decreasing clock pressure):

But again the results were generally similar – a general decline, and a sharp decline at low memory values.

(In fact, locking in memory of the leftmost process actually had little effect – as shown below:)

But when the same approach was taken to the rightmost process – ie the process that has run for the longest time (and presumably may also run for a long time in the future), the result was a catastrophic decline in performance at small memory values:

And what is behind the slowdown? Using profiling tools the biggest reason seems to be that the wrong pages are being pushed out of the caches and  need to be fetched back in. At 40MB of free memory both patched and unpatched kernels show similar profiles with most time spent scheduling and waiting for I/O requests – but the slowness of the patched kernel shows that this has to be done many more times there.

There is much more in the report itself – including an examination of Denning’s formulation of the space-time product  – I conclude his is flawed (update: in fairness to Peter Denning, who has pointed this out to me, this is as regards his approximation of the space-time product: Denning’s modelling in the 70s also accounted for the additional time that was required to manage the working set) as it disregards the time required to handle page replacement – and the above is all a (necessary) simplification of what is in the report – so if you are interested please read that.

Applying working set heuristics to the Linux kernel

# The binomial distribution, part 1

I think there are now going to be a few posts here which essentially are about me rediscovering some A level maths probability theory and writing it down as an aid to memory.

All of this is related as to whether the length of time pages are part of the working set is governed by a stochastic (probabilistic) process or a deterministic process. Why does it matter? Well, if the process was stochastic then in low memory situations a first-in, first-out approach, or simple single queue LRU approach to page replacement might work well in comparison to the 2Q LRU approach currently in use. It is an idea that is worth a little exploring, anyway.

So, now the first maths aide memoire – simple random/probabilistic processes are binomial – something happens or it does not. If the probability of it happening in a unit time period is $p$ (update: is this showing up as ‘nm’? It’s meant to be ‘p’!) then the probability it will not happen is $1 - p = q$.  For instance this might be the probability that an atom of Uranium 235 shows $\alpha$particle decay (the probability that one U 235 atom will decay is given by its half-life of 700 million years ie., $2.21\times10^{16}$ seconds, or a probability, if my maths is correct, of a given individual atom decaying in any particular second of approximately $4.4\times10^{-16}$.

(In operating systems terms my thinking is that if the time pages spent in a working set were governed by similar processes then there will be a half life for every page that is read in. If we discarded pages after they were in the system after such a half life, or better yet some multiple of the half life, then we could have a simpler page replacement system – we would not need to use a CLOCK algorithm, just record the time a page entered the system and stick it in a FIFO queue and discard it when the entry time was more than a half life ago.

An even simpler case might be to just discard pages once the stored number reached above a certain ‘half life’ limit. Crude, certainly, but maybe the simplicity might compensate for the lack of sophistication.

Such a system would not work very well for a general/desktop operating system – as the graph for the MySQL daemon referred to in the previous blog shows, even one application could seem to show different distributions of working set sizes. But what if you had a specialist system where the OS only ran one application – then tuning might work: perhaps that could even apply to mass electionics devices, such as Android phones – after all the Android (Dalvik) VM is what is being run each time.)