The five minute rule etc., redux


So I made a bit of a mess with this the the first time round and when it ended up on Slashdot (I had assumed that wasn’t happening as the delay between positing and it appearing was a few days), I was left rather embarrassed.

With a thanks to all those who commented and pointed out the errors – here’s my attempt to do it properly. I haven’t agreed with all the comments – eg it was pointed out RAID wasn’t an option in the mid 80s so shouldn’t be part of a comparison now, but I think this is about real world choices, so RAID would probably feature, etc…

This paper – The 5 Minute Rule for Trading Memory for Disc Accesses and the 5 Byte Rule for Trading Memory for CPU Time – looked at the economics of memory and disk space on big iron database systems in the early/mid 1980s and concluded that the financial trade-off between disk space (then costing about $20000 for 540MB disk (about 3.7 cents or there abouts per KB) and (volatile) memory (about $5 per kilobyte) favoured having enough memory to keep data you need to access around once every five minutes in memory.

It then compared the cost of computing power (which was estimated to cost about $50,000 per MIPS) and memory – eg if you compressed data to save on memory space you will have to use additional computing power to access the data. Here the trade off is calculated to be about 5 bytes of memory per instruction per second.

What do these comparisons look like now? The original paper explicitly ruled out applying these sort of comparisons to PCs – citing limited flexibility in system design options and different economics. But we won’t be so cautious.

The five minute rule updated

Let us consider a case with 2TB SSD disks. These cost about £500 (and probably about $500, we are approximating) and let’s say we are going with a RAID 5 arrangement – so actually we need 4 ‘disks per disk’ (£2000) and the cost is then 0.0001 penny per kilobyte (about 4 orders of magnitude less than 35 years ago).

As discussed above I am using the RAID figure even though RAID wasn’t a practical option in 1985 because I think this is about real world choices not theoretical limitations.

And memory – for simplicity we are going to say 128GB of DRAM costs us £1000 (an over-estimate but fine for this sort of calculation). That means memory costs about 0.001p per KB – a fall of around 7 orders of magnitude.

Using the same regimen as the original paper – but considering 4KB pages – and assuming that the disk system supports 10000 accesses per second then the cost for a disk is about 5p/a/s (about 5 – 6 orders of magnitude less than in 1985). We ignore the costs of supporting a disk controller here but conceivably we might want to add another few pence to that figure.

If we then think that making one 4KB page resident in memory saves 1 access per second, the disk cost saved is 5p at the cost of 0.004p. Saving 0.1 accesses per second saves 0.5p at the cost of 0.004p and the break even point is roughly 0.0008 accesses per second – or alternatively we need to hold pages in memory for 1/0.0008 seconds – 1250 seconds or about 20 minutes.

Alternatively this means caching systems should aim to hold items that are accessed every 20 minutes or so.

And the trade off between computing power and memory…

As mentioned above, back in 1985, computing power was estimated to cost about $50,000 per Million Instructions Per Second (MIPS). These days single core designs are essentially obsolete and so it’s harder to put a price on MIPS – good parallel software will drive much better performance from a set of 3GHz cores than hoping a single core’s 5GHz burst speed will get you there. But, as we are making estimates here we will opt for 3000 MIPS costing you £500 and so a single MIPS costing 17p, and a single instruction per second costing (again approximating) 0.00002 pence. (Contrast this with a low-end microcontroller which might give you an IPS for about 0.00003p or a bit less).

Computing power has thus become about 5 orders of magnitude cheaper – but as we note above memory prices have fallen at an even faster rate.

Now an instruction costs 0.00002p and a byte costs 0.0000001p, so we need to save about 200 bytes to make an additional instruction worthwhile – meaning that cost-efficient data compression of easily accessible memory is hard to do.

None of the easy things work


This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.

It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.

I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.

The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.

In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).

Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.

That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?

I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.

Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.

So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.

This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.

The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.

OPT and LRU compared - both fundamentally stymied by memory shortage
OPT and LRU compared – both fundamentally stymied by memory shortage

 

Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?

Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.

Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.

So, the easy things do not work. Now I need to look at the hard things.

I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.

Enhanced by Zemanta

The cost of soft/minor faults


Here is a complete graph of the memory use and fault count for a run of ls -lia.

ls -lia memory useFault count for ls -liaAs you can see there are only soft/minor faults here – as one would expect (this was a machine with a lot of memory), as the C library provides the ls function and it will be loaded in memory (presumably the filesystem information was also in memory).

But there are a lot of soft faults – and these too have a cost, even if nothing like the cost of a hard fault. For a start each soft page fault almost certainly indicates a miss in the processor cache.

The paper linked here also gives a lot of information about Linux’s generation of soft/minor faults and their performance impact – it seems that the kernel is designed to deliver system wide flexibility at the expense of performance.

Some thoughts on the Linux page cache


I am still waiting the result of the Birkbeck exam board to find out whether I have been awarded the MSc, but I think I have waited long enough before discussing my thoughts on some of the issues around page caching my project raised.

Linux‘s page caching is based on the “2Q” concept – pages in use by the system are stored in one of two queues (hence “2Q”) – in one of two states – see the diagram below, though this also includes states for pages being flushed from the cache:

Page lifecycle

Typically a page enters the cache in the INACTIVE UNREF state. If the cache needs to be trimmed pages in this state are the most likely to go (I am simplifying this for ease of exposition). On reference the page will be promoted to the INACTIVE REF state and on a further reference could join the ACTIVE queue – again in the ACTIVE UNREF state.

A “clock” mechanism with two sweeping hands controls the demotion of pages: the first hand marks any page in the REF state as UNREF and if a subsequent reference has not pushed the page back to the REF state then the page could be either removed from the ACTIVE queue or, if in the INACTIVE queue, from the cache completely (this is actually a bit more complicated, for instance pages which are “dirty”, ie requiring write back to disk, are less likely to be demoted than clean pages which can just be dropped, and so on.)

The 2Q mechanism is designed to stop pages which are only referenced once (eg a page required on initialisation) from being kept in the cache in preference to a page that might be referenced more than once – essentially frequently referenced pages get bumped up to the ACTIVE queue are so are less likely to be pushed out of the cache – eg a page in the ACTIVE REF state would face three calls to shrink_xxxx_list before being removed and if another reference happens in that time (crudely, three sweeps of the clock) it will be preserved in the cache.

This behaviour is generally thought to be a strength, but what if it were also a weakness?

The graphic below shows the memory access patterns of a simple Firefox open and close:

Firefox memory access mapping

It can immediately be seen that memory access falls into clear phases. The problem with 2Q is that it works well within those phases but as the application moves from one phase of memory access to another, 2Q risks holding pages from the old phase in memory for too long – indeed pages from the new phase could even face eviction to preserve pages (once) in the ACTIVE list in memory if memory pressure was high. The old pages have a very high or even infinite reuse distance, but are privileged by 2Q.

How can this be fixed? That’s not so easy to answer – especially as the pages from the old phase may be in use elsewhere in the system (for instance if they came from a shared library). But I am convinced that finding an answer to this could have a big impact on the performance of low memory systems.