Updating the five minute and the five byte rules

It’s fair to say I made quite a lot of mistakes in this article – so don’t read what follows (though the comments are of interest) – but try this fixed version.

(So I had quite a lot of errors in this blog. I hope I’ve fixed them now and apologies for the mistakes.)

This paper – The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time – looked at the economics of memory and disk space on big iron database systems in the early/mid 1980s and concluded that the financial trade-off between disk space (then costing about $20000 per 540MB or 3.7 cents or there abouts per KB) and (volatile) memory (about $5 per kilobyte) favoured having enough memory to keep data you need to access around once every five minutes in memory.

It then compared the cost of computing power (which was estimated to cost about $50,000 per MIPS) and memory – eg if you compressed data to save on memory space you will have to use additional computing power to access the data. Here the trade off is calculated to be about 5 bytes of memory per instruction per second.

The trade off between memory and disk

Now the paper explicitly ruled out applying these sort of comparisons to PCs – citing limited flexibility in system design options and different economics. But we won’t be so cautious.

So what are the trade offs today? Well let us consider a case with 2TB SSD disks. These cost about £500 (and probably about $500, we are approximating) and let’s say we are going with a RAID 5 arrangement – so actually we need 4 ‘disks per disk’ and the cost is then 0.0001 penny per kilobyte (about 4 orders of magnitude less than 35 years ago).

And memory – for simplicity we are going to say 128GB of DRAM costs us £1000 (an over-estimate but fine for this sort of calculation). That means memory costs about 0.001p per KB – so the cost of both media has fallen hugely.

To make the calculation we need to look at access times. We’ll assume that we can access (read) a 4KB page on our SSD in 100 microseconds, so can handle 10,000 such reads a second. Four KB of memory costs us 0.004p and our disk costs us 20p/a/s. So following the logic in the original paper – if we stored a page in memory it has cost us 0.004p but saved 20p in a second, so if we save a page every 5000 seconds we have cost 20p – the break even point.

Alternatively we can think of this as meaning our caching or virtual memory schemes should target those pages that are accessed every 5000 seconds – about an hour and a half! The problem with that is that it implies a memory system of roughly 16TB to be truly efficient.

The tradeoff between memory and computing power

As mentioned above, back in 1985, computing power was estimated to cost about $50,000 per Million Instructions Per Second (MIPS). These days single core designs are essentially obsolete and so it’s harder to put a price on MIPS – good parallel software will drive much better performance from a set of 3GHz cores than hoping a single core’s 5GHz burst speed will get you there. But, as we are making estimates here we will opt for 3000 MIPS costing you £500 and so a single MIPS costing £0.17, and a single instruction per second costing (again approximating) 0.00002 pence.

Toughly speaking we have a byte of memory costing 0.000001p – maybe 20 times cheaper than an instruction. This suggests that there isn’t much to be gained in data compression at all.

But there are some big assumptions here – again that we have all the memory we need and that there is no instruction cost in having lots of memory.

A very first draft…

Our proposal here is to meet this need to use a memory space much greater than the locally available low-latency storage head on. The problems faced by programmers of today’s many- and multi-core devices are, in a sense, far from new: in the 1960s an earlier generation of programmers, experimenting with virtual memory and time-sharing computers, and using programs generated by the first generation of high-level languages were struck down by the phenomenon of thrashing, where computing time was lost to excessive memory management [23]. The solutions we propose here draw on the lessons learned at that time, but significantly adapt the methods to account for the realities of using multicore devices with very limited local storage.

We propose to enable virtual memory [24] for such systems, using a novel form of paging which, instead of transferring whole pages across the memory hierarchy, concentrates on a minimal transfer of partial pages [50]. Here hard faults still impose a significant penalty of lost cycles but the cost of paging in and out of memory is amortised across multiple memory reads and writes. This avoids unnecessary transfer and limits congestion in the memory interconnect, which we find to be a significant determinant of program performance.

One of the lessons of the 1960s was that maintaining a program’s working set in low latency memory was the most effective way of limiting paging and thrashing [22]. Much research at that time and into the 1970s and later concentrated on effective ways to maintain the working set of pages in fast memory and innovation continued into the 1990s (and beyond), for instance with the 2Q algorithm in Linux [38]. Here, though, we establish that better performance in embedded and real-time manycore systems is unlikely to come from a better page replacement algorithm but from a reorientation towards better management of the internals of the pages.

General computing page replacement algorithms are generally robust [49] and with the decreasing cost of memory, much recent research has concentrated on using large page sizes, not least to minimise TLB misses [9], we return to earlier research findings which emphasise the efficiency of small pages on small systems and we show that combining this with a FIFO page replacement policy may, perhaps counter-intuitively, deliver more time predictable or even faster performance than attempts to use some version of a least-recently used policy, though we also show that choice of page replacement algorithm is likely to be strongly related to the variation in working set size over a program’s lifecycle.

Concerns about power have always been important for embedded devices, but changing characteristics of hardware mean new issues such as “dark silicon” [28] will be important as we move deeper into the coming manycore era. We demonstrate here that constraints in power may be, to some extent, mitigated by a loosening of bottlenecks elsewhere in a realtime system, particularly in the memory interconnect: in a system where power is no constraint then congestion here dominates performance, where power concerns limit processor activities they may also lessen congestion.

Further we demonstrate results that some frequency scaling in manycore systems may not limit performance if it contributes to lessening congestion in the memory interconnect: results that suggest that the dark silicon issue may not be as big a barrier to performance in real-world applications as might be expected.

Simulating global and local memory on OVP

The Simons' BASIC start-up screen
The Simons’ BASIC start-up screen (Photo credit: Wikipedia)

Had a good meeting with my PhD supervisor today: he was in London – I didn’t have to make a flying visit to York.

So the next steps with my OVPsim Microblaze code is to model global and local memory – by default OVPsim treats all memory as local, mapping the full 32-bit address space and sparsely filling that as needed. I have imposed an additional constraint of only mapping a few pages, but these are still all “local”: so while the code takes time to execute, what is in effect, a FIFO page replacement algorithm, there is no time for page loads.

The way round this seems to be to build my global memory as a memory-mapped peripheral device – I can then impose time delays on reads and writes.

But I suppose I am writing this blog instead of writing that code…

Why I owe @slugonamission a (or several) large drink(s)

gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

I have been struggling with a problem with my Microblaze simulation on OVPsim all week.

I have been trying to make a start on implementing a simple demand paging system and to trigger the hardware exception that the Microblaze CPU manual says should happen when you either try to read from a mapped page that is not in the TLB (ie, a “minor” page fault) or has not been mapped at all (a “major” page fault).

My problem was that – despite having, I thought, specified all the correct parameters for the Microblaze and knowing that I had virtual memory mappings work well, it simply was not happening for me. Instead of the expected exception handling, the simulation reported an error and exited.

But Jamie Garside from the University of York’s computer science department saved my life by (correctly) pointing out that what I also needed to do was turn on ICM_ATTR_SIMEX. Otherwise the simulation will always exit on an exception (or halt in GDB).

I can see why that might make sense – at least the halting in the debugger bit: if an exception is about to be fired you can halt the machine while it is still executing in virtual, as opposed to real, mode and see what has caused it.

It was also a reminder to RTFM – in this case not just the Microblaze manual, but also the OVPsim manual. I cannot rely on Jamie doing that for me everytime.

So to fix my problem I ORed in the missing attribute:


Latest Microblaze puzzle

Having successfully written a couple of simple virtual memory mappings I now want to write the exception handling code that will allow me to swap mappings in and out – so I did this:

	ori	r20, r0, 0
	ori	r12, r0, 0x1000000
	/* next line should break it */
	sw	r20, r12, r0

I expected that the marked line would raise a hardware exception as a TLB miss and I’d be taken to the appropriate vector etc. Instead it just tells me I’ve had a SIGSEGV.


Of course turning on the MMU breaks your code!

Prinzipdarstellung der Arbeitsweise einer MMU
(Photo credit: Wikipedia)

Tonight I finally managed to get the code and the sequencing right to not just boot the Microblaze simulation, but to turn on the MMU.

But I was puzzled as to why, as soon as I had done that, the execution breaks with a segmentation fault and appeared to be executing code I had not written. And then it dawned on me – all my addresses had immediately become virtual addresses.

Two steps forward, one step back…

Scale of the task

English: Messages from the Linux kernel 3.0.0 ...
English: Messages from the Linux kernel 3.0.0 booting, from Debian sid i386. (Photo credit: Wikipedia)

I have had a frustrating few days trying to get to grips with two new pieces of the technology: the OVP simulator and the Microblaze processor.

Finally I think the fog is beginning to clear. But that also reveals just what a task I have in front of me: namely to write some kernel code that will boot the Microblaze, establish a virtual memory system and then hand over control to user code, which will have to trap memory faults and pass control back to the privileged kernel.

It is not quite writing an operating system, even a simple one, but it is actually undertaking to write what would be at the core of an OS.

Of course, there are lots of places to borrow ideas from – not least the Linux kernel – but it’s a bit daunting, if also reasonably exciting.

Preciously little books about to help – I shelled out to buy this (having borrowed it from the York Uni library and found it to be an excellent general introduction to the area) – but it’s not a guide to OVP never mind to the Microblaze. If anyone does know of a book that does either I’d be very grateful (maybe it’s my age but electronic books are very much second best to me – you just cannot flick from page to page looking for that key element you read the other day and so on.)

After paging?

Diagram of relationship between the virtual an...
Diagram of relationship between the virtual and physical address spaces. (Photo credit: Wikipedia)

Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.

Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.

So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.

It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.

But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.

But my PhD research is pointing me towards some of the weaknesses of the paging approach.

At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.

Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.

But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.

Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.

I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.

Enhanced by Zemanta

In the steps of László Bélády

Update: I have truncated this article for now (20 December) as there was an error in my LRU software that made LRU look like a much better performer than it really was. I’ll update this with the correct data shortly….


In 1966 László Bélády published “A study of replacement algorithms for virtual storage computers”, one of the truly epoch making papers for operating system science – the first comprehensive examination of page replacement strategies for virtual memory computers.

These days all but the simplest embedded computing devices will use some sort of virtual memory system because it allows computing devices to (relatively) seamlessly load bits of computer programs in and out of memory as needed – the programs see a faked – virtual – address and so the chunks can be loaded in an out of whatever piece of memory is available without worrying about having to get the chunks into exactly the same place every time.

But in 1966 virtual memory was a new and essentially experimental technology and so Belady’s examination of the different strategies for deciding which chunk (page) of memory was kept or replaced when new pages were required to be loaded is the foundation stone of all the approaches that followed.

This last couple of weeks I have found myself walking in the steps of Bélády as I built software to examine the different performance characteristics of potential page replacement policies in a network-on-chip computer.

I have about 220GB of XML data which represents a record of the memory accesses of an 18 threaded video processing application – and using that data I can test computer system performance using various different policies.