I have had a frustrating few days trying to get to grips with two new pieces of the technology: the OVP simulator and the Microblaze processor.
Finally I think the fog is beginning to clear. But that also reveals just what a task I have in front of me: namely to write some kernel code that will boot the Microblaze, establish a virtual memory system and then hand over control to user code, which will have to trap memory faults and pass control back to the privileged kernel.
It is not quite writing an operating system, even a simple one, but it is actually undertaking to write what would be at the core of an OS.
Of course, there are lots of places to borrow ideas from – not least the Linux kernel – but it’s a bit daunting, if also reasonably exciting.
Preciously little books about to help – I shelled out to buy this (having borrowed it from the York Uni library and found it to be an excellent general introduction to the area) – but it’s not a guide to OVP never mind to the Microblaze. If anyone does know of a book that does either I’d be very grateful (maybe it’s my age but electronic books are very much second best to me – you just cannot flick from page to page looking for that key element you read the other day and so on.)
Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.
Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.
So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.
It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.
But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.
But my PhD research is pointing me towards some of the weaknesses of the paging approach.
At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.
Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.
But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.
Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.
I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.
Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….
In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.
These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.
But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.
This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.
I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.
This is going to be one of those blog posts where I attempt to clarify my thoughts by writing them down … which also means I might change my mind as I go along.
My problem is this: I have elected to, as part of my PhD, explore the prospect of building a virtual memory system for Network-on-Chip (NoC) computers. NoCs have multiple processors – perhaps 16, or 64 or (in the near future) many more, all on one piece of silicon and all connected by a packet-switched network. The numbers are important – because having that many processors (certainly a number greater than 16) means that the, so far more typical, bus-based interconnects do not work and that also means that the different processors cannot easily be told which other processor is trying to access the same slither of off-chip memory that they are after.
As a result, instead of increasing computing speed by seeing more processors crunch a problem in parallel, the danger is that computing efficiency falls off because either (A) each processor is confined to a very small patch of memory to ensure it does not interfere with other processors’ memory accesses, or (B) some very complicated and expensive (in time) logic is applied to ensure that each processor does know what accesses are being made, or (C) some combination of the above e.g., a private area which the processor can access freely and a shared area where some logic in software polices accesses.
None are perfect – (A) could limit processor numbers, (B) could be slow while (C) could be slow and also not work so well, so limiting processor numbers. So (C) is the worst of both worlds? Well, (C) is also, sort-of, my area of exploration!
Other researchers have already built a virtual memory system for another NoC, the Intel 48 core SCC. I don’t want to just repeat their work (I doubt that would impress my examiners either) in any case, so here are, roughly my thoughts:
There is a choice between a page-based VM and one that manages objects. As an experimental idea the choice of managing objects quite appeals – but it also seems difficult to have a system that was efficient and managed objects without that being on top of some sort of page-based system.
What is the priority for a VMM? To provide a shared space for the operating system and its code (too easy?), or to deliver memory to applications? Should this then be a virtual machine layer underneath the operating system? (This is what the SCC folk did – RockyVisor).
Given that message passing seems a better fit for NoCs than shared memory in any case – how should message passing systems integrate with a VMM? Should we go down the route advocated by the builders of the Barrelfish operating system and absolutely rule out shared memory as a basis of processor interco-operation – just using the VMM as a means of allocating memory rather than anything else? (I think, yes, probably)
But if the answer to the above is ‘yes’ are we sacrificing efficiency for anti-shared memory dogma? I worry we may be.
Two big thoughts strike me as a result of the literature review I have just completed for my PhD:
Linux is not the centre of the universe, in fact it is a bit of an intellectual backwater;
The UK may have played as big a role in the invention of the electronic computer as the US, but these days it is hardly even in the game in many areas of computing research.
On the first point I am in danger of sounding like Andy “Linux is obsolete” Tanenbaum – but it is certainly the case that Linux is far from the cutting edge in operating system research. If massively parallel systems do break through to the desktop it is difficult to imagine they will be running Linux (or any monolithic operating system).
In fact the first generation may do – because nobody has anything else right now – but Linux will be a kludge in that case.
Doing my MSc which did focus on a Linux related problem, it seemed to me that we had come close to “the end of history” in operating system research – ie the issue now was fine tuning the models we had settled on. The big issues had been dealt with in the late 60s, the 70s and the 80s.
Now I know different. Operating systems research is very vibrant and there are lots of new models competing for attention.
But along the way the UK dropped out of the picture. Read some papers on the development of virtual memory and it will not be long before you come across the seminal experiment conducted on EMAS – the Edinburgh University Multiple Access System – which was still in use when I was there in 1984. Now you will struggle to find any UK university – with the limited exceptions of Cambridge and York (for real-time) – making any impact (at least that’s how it seems to me).
It’s not that the Americans run the show either – German, Swiss and Italian universities are leading centres of research into systems software.
I am not sure how or why the UK slipped behind, but it feels like a mistake to me – especially as I think new hardware models are going to drive a lot of software innovation in the next decade (well, I would say that, wouldn’t I?)
By 2020 home computing devices in the UK will be consuming around 7 Terra Watt Hours (TWh) of electricity every year: it was just 1 TWh in 1990.
Consumer electronic devices, all of which will be running some software and many of which will have what can loosely be described as an operating system, will be eating a massive 22 TWh, almost double where they were in 1990.
Essentially this rise of the computing machines more than matches the falls in electricity use that come from technological improvements in domestic lighting and refrigeration over this time.
Operating systems research has been seriously neglected in our universities in recent years (and I do not just mean in the UK): maybe that ought to be reconsidered and urgently.
How systems order their storage accesses, how they handle virtual memory, sequence their access to the network, and many more questions besides have a big impact on computing power use. And, at 29 TWh, just a 1% saving would lighten domestic bills by about £30 million. And that excludes the positive impact on greenhouse gas emissions.
(There is a Guadian article about this but I cannot see it on their website yet – when I can I’ll link to it.)
The graph above shows the number of (4k) pages required by xterm when running on a 32 bit Linux system – the x axis measures the lifetime of the process (as determined by instructions executed), the y axis the number of pages.