How much memory do you need?

What’s the best way to speed up your computing experience? As many know the most cost-effective way is often not to buy a new computer with a faster processor but to add more memory to an existing system.

The plot below, based on some results from my PhD research shows just how this works…

timingsIn my case I am measuring how long it takes a processor of a fixed speed to execute a given program while I vary the amount of memory available.

My research centres on “network-on-chip” systems so the issue here is how much memory is available locally (i.e., on the chip). Because if instructions and data are not locally available they have to be fetched in from some more “distant” store (whether that is system global memory or even a disk system). And if space on the chip is limited then you generally have to evict some other piece of memory to make way for the piece needed immediately. Yet if you need to evicted piece again in the future you have to take the time needed to fetch it back in again and so on.

In this case we are using 4096 byte (4K) blocks of memory – i.e., 4K pages. So when it says 10 pages on the x-axis that means there is 40K available locally and so on.

I am testing all this on the OVPsim emulator and I have no independent means of accurately timing how long the process takes – but I can count the number of instructions it takes to complete the task.

Factors that affect the number of instructions taken are – in order of likely importance:

  • the time taken to fetch whole pages in and out of memory – a “hard fault” occurs when a page has to be evicted and a new page brought in instead (the earliest hard faults are not as slow as the later ones as until the full number of pages are used no eviction is required);
  • the time it takes to ensure that pages that are in memory but which are not immediately available for execution (because the number of translation mappings that tell the central processor where they are is limited – to 8 in this test case – and so for each fault we first have to test if the page is available but unmapped). If an unmapped page is available we have a “soft fault” which is much quicker to process than a hard fault as no eviction etc is required;
  • picking the next page that will be evicted if have a hard fault – in this case we aim to pick the page that was least recently used, but even then we have to use an approximation. This process is not the same as evicting the page – it merely indicates which one is the best candidate for eviction should that become necessary.

The number of hard faults is reduced by increasing the amount of memory available (until, of course, you have so much memory you never need evict a page). But as you increase the amount of memory you also make checking for soft faults and picking the next candidate for eviction slower – because there are more pages to check.

And this is what we see in the plot. When memory is in short supply adding even a single extra page can make a big difference. There is then a long range where the decrease is not so great but constant. This reflects the fact that memory addresses are not needed at random and that programs show ‘locality of preference’ (ie if you need one address at one moment you are quite likely to need a nearby address at the next). This locality means that adding extra pages can have a limited impact once you have enough to cover the range of likely picks in a reasonable time frame. Then adding extra memory means that the changes between the different phases of locality becomes smoother, so we still see a reduction in time but not anywhere near as pronounced as before.

Then we get to the point – about 30 pages in this case – where we are close to covering the full range of addresses used, even when we switch between phases. In this case we see a sudden fall again until – at about 33 – 35 pages – we seem to have covered every single address our program will ever use.

After that having more memory makes the system slightly slower (NB in most real-world desk and laptop systems adding more memory does not slow down the system – so don’t worry about that!).

The lesson: if your computer is really slow and you don’t want to replace it, add more memory. But if you computer already has so much memory that it doesn’t know what to do with it and it is slow, you have a bigger problem!

What that working set comparison graph should have looked like

Working sets for Xterm

The graphs look similar but the differences are important – this one (the correct one), appears to confirm that Peter Denning‘s findings about the working set model versus LRU still hold good, at least in broad terms – though this still suggests LRU has better performance characteristics than might be expected.

But it’s late now and I am going to bed – perhaps more later.

The graph is wrong

Once I published the graph on the previous blog entry I more or less immediately realised it was wrong – it’s not that the curves are wrong: it’s that they are plotted on different scales.

The red line plots lifetime curve using the working set of the process based on pages accessed in a certain time frame – this gives an average working set size (\theta ) which is plotted along the x axis.

The blue line is the lifetime curve with a maximum working set of a fixed size (ie it is a simple LRU type page replacement simulation). But it is not scaled correctly against the x axis. Something of a waste of 31 hours of wallclock time!

Happily my 12 core box is now online and so I should have a replot done shortly – my best guess is that it may not change too much, things will be interesting if it does.

Better algorithm found

An animation of the quicksort algorithm sortin...
Image via Wikipedia

Late last night, in a state of some despair, I pleaded for computing power help with my MSc project. Today, having thought about it long and hard I found a better algorithm and speeded the whole thing up by a factor of 50.

Like, I am sure, many people previously in my position, my inspiration was the classic Programming Pearls.

This is a Birkbeck set text but it’s also one that I did not read when I should have last year – or rather I just read a small part of it. Luckily I am systematically going through it now and was yesterday reading about how much of a difference a better algorithm can make – not that I did not know this, but it helped provide the determination to find one.

So what did I do. Firstly I realised that there was a specialist implementation of the Java Map interface that, as the Javadoc file explicitly says, is good for LRU caches (essentially what I am seeking to model): LinkedHashMap.

This sets up a last-accessed order and so means that it is no longer necessary to search through the whole Map to find the out-of-date pages. Using an Iterator and an EntrySet I only need to get to the first page that is not out-of-date and stop.

When I was checking that worked I noticed that temporal locality meant that in many cases the LRU page was still inside the time frame I was seeking to check – in other words literally billions of checks for outdated pages were taking place needlessly. As pages in the cache cannot get “older” (ie there reference time cannot go backwards), at time \tau + \delta the oldest page cannot be any older than it was at \tau – hence if we do not check until we have reached the point here we need to expire pages with time \tau we will not miss any needed evictions.

The result of these two changes is a massive speed up in the code – by a factor of 40 – 50.