Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….
In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.
These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.
But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.
This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.
I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.
This is going to be one of those blog posts where I attempt to clarify my thoughts by writing them down … which also means I might change my mind as I go along.
My problem is this: I have elected to, as part of my PhD, explore the prospect of building a virtual memory system for Network-on-Chip (NoC) computers. NoCs have multiple processors – perhaps 16, or 64 or (in the near future) many more, all on one piece of silicon and all connected by a packet-switched network. The numbers are important – because having that many processors (certainly a number greater than 16) means that the, so far more typical, bus-based interconnects do not work and that also means that the different processors cannot easily be told which other processor is trying to access the same slither of off-chip memory that they are after.
As a result, instead of increasing computing speed by seeing more processors crunch a problem in parallel, the danger is that computing efficiency falls off because either (A) each processor is confined to a very small patch of memory to ensure it does not interfere with other processors’ memory accesses, or (B) some very complicated and expensive (in time) logic is applied to ensure that each processor does know what accesses are being made, or (C) some combination of the above e.g., a private area which the processor can access freely and a shared area where some logic in software polices accesses.
None are perfect – (A) could limit processor numbers, (B) could be slow while (C) could be slow and also not work so well, so limiting processor numbers. So (C) is the worst of both worlds? Well, (C) is also, sort-of, my area of exploration!
Other researchers have already built a virtual memory system for another NoC, the Intel 48 core SCC. I don’t want to just repeat their work (I doubt that would impress my examiners either) in any case, so here are, roughly my thoughts:
There is a choice between a page-based VM and one that manages objects. As an experimental idea the choice of managing objects quite appeals – but it also seems difficult to have a system that was efficient and managed objects without that being on top of some sort of page-based system.
What is the priority for a VMM? To provide a shared space for the operating system and its code (too easy?), or to deliver memory to applications? Should this then be a virtual machine layer underneath the operating system? (This is what the SCC folk did – RockyVisor).
Given that message passing seems a better fit for NoCs than shared memory in any case – how should message passing systems integrate with a VMM? Should we go down the route advocated by the builders of the Barrelfish operating system and absolutely rule out shared memory as a basis of processor interco-operation – just using the VMM as a means of allocating memory rather than anything else? (I think, yes, probably)
But if the answer to the above is ‘yes’ are we sacrificing efficiency for anti-shared memory dogma? I worry we may be.
Two big thoughts strike me as a result of the literature review I have just completed for my PhD:
Linux is not the centre of the universe, in fact it is a bit of an intellectual backwater;
The UK may have played as big a role in the invention of the electronic computer as the US, but these days it is hardly even in the game in many areas of computing research.
On the first point I am in danger of sounding like Andy “Linux is obsolete” Tanenbaum – but it is certainly the case that Linux is far from the cutting edge in operating system research. If massively parallel systems do break through to the desktop it is difficult to imagine they will be running Linux (or any monolithic operating system).
In fact the first generation may do – because nobody has anything else right now – but Linux will be a kludge in that case.
Doing my MSc which did focus on a Linux related problem, it seemed to me that we had come close to “the end of history” in operating system research – ie the issue now was fine tuning the models we had settled on. The big issues had been dealt with in the late 60s, the 70s and the 80s.
Now I know different. Operating systems research is very vibrant and there are lots of new models competing for attention.
But along the way the UK dropped out of the picture. Read some papers on the development of virtual memory and it will not be long before you come across the seminal experiment conducted on EMAS – the Edinburgh University Multiple Access System – which was still in use when I was there in 1984. Now you will struggle to find any UK university – with the limited exceptions of Cambridge and York (for real-time) – making any impact (at least that’s how it seems to me).
It’s not that the Americans run the show either – German, Swiss and Italian universities are leading centres of research into systems software.
I am not sure how or why the UK slipped behind, but it feels like a mistake to me – especially as I think new hardware models are going to drive a lot of software innovation in the next decade (well, I would say that, wouldn’t I?)
By 2020 home computing devices in the UK will be consuming around 7 Terra Watt Hours (TWh) of electricity every year: it was just 1 TWh in 1990.
Consumer electronic devices, all of which will be running some software and many of which will have what can loosely be described as an operating system, will be eating a massive 22 TWh, almost double where they were in 1990.
Essentially this rise of the computing machines more than matches the falls in electricity use that come from technological improvements in domestic lighting and refrigeration over this time.
Operating systems research has been seriously neglected in our universities in recent years (and I do not just mean in the UK): maybe that ought to be reconsidered and urgently.
How systems order their storage accesses, how they handle virtual memory, sequence their access to the network, and many more questions besides have a big impact on computing power use. And, at 29 TWh, just a 1% saving would lighten domestic bills by about £30 million. And that excludes the positive impact on greenhouse gas emissions.
(There is a Guadian article about this but I cannot see it on their website yet – when I can I’ll link to it.)
The graph above shows the number of (4k) pages required by xterm when running on a 32 bit Linux system – the x axis measures the lifetime of the process (as determined by instructions executed), the y axis the number of pages.
So, I fixed the problem I was having with ranges by reducing the number of times the range was being accessed by something like 1000 orders of magnitude by pushing the calculation to much later on in the program (ie instead of checking the range on every calculation, check it only at the end when some other filtering has already happened).
And so I can now show the world a graph that illustrates how xterm‘s initialisation does indeed show different phases of locality – ie that in a small enough time frame spacial locality is shown in memory accesses, but over a bigger time space that locality will shift – this is important because it indicates that a good virtual memory manager would be able to flush the pages from the old locality out of the system:
Plainly I need to do a bit more work on the graphing software – but the yellow lines are the axis and the green dots indicate where the memory is being accessed – time (or rather its analogue in terms of instructions executed) runs from left to right and (process) memory address gets greater as you move from the bottom up.
In fact this graph is only looking at the bottom 10% of the range of memory accessed by the application (and we are ignoring the instructions in favour of explicit heap loads, stores and modifications) – but the problem is that for most of that range nothing is accessed – again I need to adjust the Groovy script to make it better.
None of this is new – this experiment was first conducted in the early 1970s – though I am not aware of anyone having done it for Linux – but I suspect they have.
Art was the only subject in which I failed a school exam – getting a low 30-something in 1980′s end of year tests. Not that I cared much. But as the years have gone by I have on more than one occasion wished I was rather better at it. Even now, trying to write a scientific paper – and for me, at least back then, art was always the opposite pole to science – my art skills are rather letting me down.
This graphic shows you why – now I have reduced it in size it looks passable (as a hugely simplified explanation of paging and virtual memory, but the arrows are still very ragged. Still, this is better than I would have managed even a year ago.