Real time devices

A very first draft…


Our proposal here is to meet this need to use a memory space much greater than the locally available low-latency storage head on. The problems faced by programmers of today’s many- and multi-core devices are, in a sense, far from new: in the 1960s an earlier generation of programmers, experimenting with virtual memory and time-sharing computers, and using programs generated by the first generation of high-level languages were struck down by the phenomenon of thrashing, where computing time was lost to excessive memory management [23]. The solutions we propose here draw on the lessons learned at that time, but significantly adapt the methods to account for the realities of using multicore devices with very limited local storage.

We propose to enable virtual memory [24] for such systems, using a novel form of paging which, instead of transferring whole pages across the memory hierarchy, concentrates on a minimal transfer of partial pages [50]. Here hard faults still impose a significant penalty of lost cycles but the cost of paging in and out of memory is amortised across multiple memory reads and writes. This avoids unnecessary transfer and limits congestion in the memory interconnect, which we find to be a significant determinant of program performance.

One of the lessons of the 1960s was that maintaining a program’s working set in low latency memory was the most effective way of limiting paging and thrashing [22]. Much research at that time and into the 1970s and later concentrated on effective ways to maintain the working set of pages in fast memory and innovation continued into the 1990s (and beyond), for instance with the 2Q algorithm in Linux [38]. Here, though, we establish that better performance in embedded and real-time manycore systems is unlikely to come from a better page replacement algorithm but from a reorientation towards better management of the internals of the pages.

General computing page replacement algorithms are generally robust [49] and with the decreasing cost of memory, much recent research has concentrated on using large page sizes, not least to minimise TLB misses [9], we return to earlier research findings which emphasise the efficiency of small pages on small systems and we show that combining this with a FIFO page replacement policy may, perhaps counter-intuitively, deliver more time predictable or even faster performance than attempts to use some version of a least-recently used policy, though we also show that choice of page replacement algorithm is likely to be strongly related to the variation in working set size over a program’s lifecycle.

Concerns about power have always been important for embedded devices, but changing characteristics of hardware mean new issues such as “dark silicon” [28] will be important as we move deeper into the coming manycore era. We demonstrate here that constraints in power may be, to some extent, mitigated by a loosening of bottlenecks elsewhere in a realtime system, particularly in the memory interconnect: in a system where power is no constraint then congestion here dominates performance, where power concerns limit processor activities they may also lessen congestion.

Further we demonstrate results that some frequency scaling in manycore systems may not limit performance if it contributes to lessening congestion in the memory interconnect: results that suggest that the dark silicon issue may not be as big a barrier to performance in real-world applications as might be expected.

Advertisements

Simulating global and local memory on OVP


The Simons' BASIC start-up screen
The Simons’ BASIC start-up screen (Photo credit: Wikipedia)

Had a good meeting with my PhD supervisor today: he was in London – I didn’t have to make a flying visit to York.

So the next steps with my OVPsim Microblaze code is to model global and local memory – by default OVPsim treats all memory as local, mapping the full 32-bit address space and sparsely filling that as needed. I have imposed an additional constraint of only mapping a few pages, but these are still all “local”: so while the code takes time to execute, what is in effect, a FIFO page replacement algorithm, there is no time for page loads.

The way round this seems to be to build my global memory as a memory-mapped peripheral device – I can then impose time delays on reads and writes.

But I suppose I am writing this blog instead of writing that code…

Why I owe @slugonamission a (or several) large drink(s)


gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

I have been struggling with a problem with my Microblaze simulation on OVPsim all week.

I have been trying to make a start on implementing a simple demand paging system and to trigger the hardware exception that the Microblaze CPU manual says should happen when you either try to read from a mapped page that is not in the TLB (ie, a “minor” page fault) or has not been mapped at all (a “major” page fault).

My problem was that – despite having, I thought, specified all the correct parameters for the Microblaze and knowing that I had virtual memory mappings work well, it simply was not happening for me. Instead of the expected exception handling, the simulation reported an error and exited.

But Jamie Garside from the University of York’s computer science department saved my life by (correctly) pointing out that what I also needed to do was turn on ICM_ATTR_SIMEX. Otherwise the simulation will always exit on an exception (or halt in GDB).

I can see why that might make sense – at least the halting in the debugger bit: if an exception is about to be fired you can halt the machine while it is still executing in virtual, as opposed to real, mode and see what has caused it.

It was also a reminder to RTFM – in this case not just the Microblaze manual, but also the OVPsim manual. I cannot rely on Jamie doing that for me everytime.

So to fix my problem I ORed in the missing attribute:

#define SIM_ATTRS (ICM_ATTR_DEFAULT|ICM_ATTR_SIMEX)

Latest Microblaze puzzle


Having successfully written a couple of simple virtual memory mappings I now want to write the exception handling code that will allow me to swap mappings in and out – so I did this:

	ori	r20, r0, 0
	ori	r12, r0, 0x1000000
	/* next line should break it */
	sw	r20, r12, r0

I expected that the marked line would raise a hardware exception as a TLB miss and I’d be taken to the appropriate vector etc. Instead it just tells me I’ve had a SIGSEGV.

Hmmm.

Of course turning on the MMU breaks your code!


Prinzipdarstellung der Arbeitsweise einer MMU
(Photo credit: Wikipedia)

Tonight I finally managed to get the code and the sequencing right to not just boot the Microblaze simulation, but to turn on the MMU.

But I was puzzled as to why, as soon as I had done that, the execution breaks with a segmentation fault and appeared to be executing code I had not written. And then it dawned on me – all my addresses had immediately become virtual addresses.

Two steps forward, one step back…

Scale of the task


English: Messages from the Linux kernel 3.0.0 ...
English: Messages from the Linux kernel 3.0.0 booting, from Debian sid i386. (Photo credit: Wikipedia)

I have had a frustrating few days trying to get to grips with two new pieces of the technology: the OVP simulator and the Microblaze processor.

Finally I think the fog is beginning to clear. But that also reveals just what a task I have in front of me: namely to write some kernel code that will boot the Microblaze, establish a virtual memory system and then hand over control to user code, which will have to trap memory faults and pass control back to the privileged kernel.

It is not quite writing an operating system, even a simple one, but it is actually undertaking to write what would be at the core of an OS.

Of course, there are lots of places to borrow ideas from – not least the Linux kernel – but it’s a bit daunting, if also reasonably exciting.

Preciously little books about to help – I shelled out to buy this (having borrowed it from the York Uni library and found it to be an excellent general introduction to the area) – but it’s not a guide to OVP never mind to the Microblaze. If anyone does know of a book that does either I’d be very grateful (maybe it’s my age but electronic books are very much second best to me – you just cannot flick from page to page looking for that key element you read the other day and so on.)

After paging?


Diagram of relationship between the virtual an...
Diagram of relationship between the virtual and physical address spaces. (Photo credit: Wikipedia)

Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.

Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.

So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.

It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.

But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.

But my PhD research is pointing me towards some of the weaknesses of the paging approach.

At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.

Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.

But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.

Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.

I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.

Enhanced by Zemanta

In the steps of László Bélády


Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….

 

In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.

These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.

But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.

This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.

I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.

Virtual memory and a new operating system


Block diagrams of a single AsAP processor and ...
Block diagrams of a single AsAP processor and the 6×6 AsAP 1.0 chip (Photo credit: Wikipedia)

This is going to be one of those blog posts where I attempt to clarify my thoughts by writing them down … which also means I might change my mind as I go along.

My problem is this: I have elected to, as part of my PhD, explore the prospect of building a virtual memory system for Network-on-Chip (NoC) computers. NoCs have multiple processors – perhaps 16, or 64 or (in the near future) many more, all on one piece of silicon and all connected by a packet-switched network. The numbers are important – because having that many processors (certainly a number greater than 16) means that the, so far more typical, bus-based interconnects do not work and that also means that the different processors cannot easily be told which other processor is trying to access the same slither of off-chip memory that they are after.

As a result, instead of increasing computing speed by seeing more processors crunch a problem in parallel, the danger is that computing efficiency falls off because either (A) each processor is confined to a very small patch of memory to ensure it does not interfere with other processors’ memory accesses, or (B) some very complicated and expensive (in time) logic is applied to ensure that each processor does know what accesses are being made, or (C) some combination of the above e.g., a private area which the processor can access freely and a shared area where some logic in software polices accesses.

None are perfect – (A) could limit processor numbers, (B) could be slow while (C) could be slow and also not work so well, so limiting processor numbers. So (C) is the worst of both worlds? Well, (C) is also, sort-of, my area of exploration!

Other researchers have already built a virtual memory system for another NoC, the Intel 48 core SCC. I don’t want to just repeat their work (I doubt that would impress my examiners either) in any case, so here are, roughly my thoughts:

  • There is a choice between a page-based VM and one that manages objects. As an experimental idea the choice of managing objects quite appeals – but it also seems difficult to have a system that was efficient and managed objects without that being on top of some sort of page-based system.
  • What is the priority for a VMM? To provide a shared space for the operating system and its code (too easy?), or to deliver memory to applications? Should this then be a virtual machine layer underneath the operating system? (This is what the SCC folk did – RockyVisor).
  • Given that message passing seems a better fit for NoCs than shared memory in any case – how should message passing systems integrate with a VMM? Should we go down the route advocated by the builders of the Barrelfish operating system and absolutely rule out shared memory as a basis of processor interco-operation – just using the VMM as a means of allocating memory rather than anything else? (I think, yes, probably)
  • But if the answer to the above is ‘yes’ are we sacrificing efficiency for anti-shared memory dogma? I worry we may be.

Any thoughts would be very welcome.

(I found a good – and reasonably priced – book that describes a working paging system along the way – What Makes It Page?: The Windows 7 (x64) Virtual Memory Manager).

Computer science in the UK: in the wrong direction?


Servers designed for Linux
Servers designed for Linux (Photo credit: Wikipedia)

Two big thoughts strike me as a result of the literature review I have just completed for my PhD:

  • Linux is not the centre of the universe, in fact it is a bit of an intellectual backwater;
  • The UK may have played as big a role in the invention of the electronic computer as the US, but these days it is hardly even in the game in many areas of computing research.

On the first point I am in danger of sounding like Andy “Linux is obsolete” Tanenbaum – but it is certainly the case that Linux is far from the cutting edge in operating system research. If massively parallel systems do break through to the desktop it is difficult to imagine they will be running Linux (or any monolithic operating system).

In fact the first generation may do – because nobody has anything else right now – but Linux will be a kludge in that case.

Doing my MSc which did focus on a Linux related problem, it seemed to me that we had come close to “the end of history” in  operating system research – ie the issue now was fine tuning the models we had settled on. The big issues had been dealt with in the late 60s, the 70s and the 80s.

Now I know different. Operating systems research is very vibrant and there are lots of new models competing for attention.

But along the way the UK dropped out of the picture. Read some papers on the development of virtual memory and it will not be long before you come across the seminal experiment conducted on EMAS – the Edinburgh University Multiple Access System – which was still in use when I was there in 1984. Now you will struggle to find any UK university – with the limited exceptions of Cambridge and York (for real-time) – making any impact (at least that’s how it seems to me).

It’s not that the Americans run the show either – German, Swiss and Italian universities are leading centres of research into systems software.

I am not sure how or why the UK slipped behind, but it feels like a mistake to me – especially as I think new hardware models are going to drive a lot of software innovation in the next decade (well, I would say that, wouldn’t I?)