Hard fault count drives performance


Just to emphasise how hard faults are determining for performance – here is a plot of hard faults versus page count for the same application mentioned in the previous post.

hard faults The pattern is very similar, though it should be noted that increasing the page count does still keep the fault count coming down at the far end – but not by enough to outweigh the negative effects of having larger page tables to search through when checking for faults and looking for the least recently used page and the like.

Which bit of a 4k page does your program use?


This follows on from the previous post – here are the plots.

These are based on a run of the PARSEC benchmark suite x264 program – encoding 32 frames of video using 18 threads – 16 worker threads: the plots show how often each 16 byte “line” is used – whether as an instruction is read or the memory is used for read-write storage. Sixteen bytes is both the size of a typical cache line and a read from a DDR memory.

all code memonly

 

The code plot might suggest there is some pattern – between about segments 100 (offset 0x640 inside the page) and 200 (offset 0xC80) there is an increased hit rate) but my guess is that is an artefact of the particular code being used here, rather a general issue (possible explanations would be a particular library function being very heavily used): though conceivably it could be an issue with the GCC compiler or linker.

That might be worth further exploration, but not for me. From visual inspection I am concluding that the distribution of accesses inside a 4k page doesn’t justify trying to pre-cache particular 16 byte “lines”.

Enhanced by Zemanta

Deductive debugging


An illustration of multithreading where the ma...
An illustration of multithreading where the master thread forks off a number of threads which execute blocks of code in parallel. (Photo credit: Wikipedia)

This is not a story of a great debugging triumph – but it is one that points to a great debugging truth – study of the bug before you start to pore over your code is more likely to get you to a solution faster.

My example is my code to simulate Belady’s “OPT” page replacement algorithm for a multithreaded environment.

In OPT the idea is that, when we need to make room in main memory for a new page, we select for removal the page with the longest “resuse distance” – in other words, the one we will have to wait the longest (perhaps forever) before needing to use again. This algorithm is sometimes called the “clairvoyant” algorithm because it requires foreknowledge of which memory page will get used when. That does not happen very often in general use computing, but can often be the case in embedded computing, where the code does exactly the same thing over and over again.

In my case I am using a memory trace from a transcoding of 32 frames of video – a small (in terms of time on a real computing device) example of the sort of parallel task you might see in embedded devices. In the real world this runs for a few seconds – but it also generates 220GB of XML trace records spread across 18 threads.

With a single thread it’s easy to work out the reuse distance – you just look at how long it will be before a page gets referenced: you could even do this statically ahead of runtime and just the result up if you wanted.

That is not true in multithreaded code – one thread might run fast or low (eg while waiting for IO) and so the only way to do it is to measure the reuse distances for each page and for every thread:

  • For each thread calculate the minimum reuse distance of each page;
  • Then pick the page with the maximum minimum reuse distance across all threads.

I wrote some code that seemed to do this and on my testing subset of the 220GB XML it seemed to deliver good results. But whenever I ran it against the full trace it started brightly but then performance – by which I mean how fast the simulator ran through the traces in terms of the synthetic ticks generated by the simulator, or bluntly the simulated performance – just seemed to get worse and worse.

In fact the longer a simulated thread seemed to be running, the worse its performance got and the “fastest” thread was always the most recently spawned one, and the more threads that ran the worse this problem got.

Now, the combination of severely limited memory (in this case we were simulating a 16 core NoC with 32Kb local memory per core, which is pooled into one flat 512Kb space), performance can go downhill quite badly as the thread count climbs – though this fall off was catastrophic and it was plain that OPT was going to be worse than “least recently used” (LRU) – and that just cannot be correct! I have not sat down to write a mathematical proof of that but instinctively I know it to be true…

Reading through the code did not draw my attention to any obvious flaws, so I had to sit down and think about what the bug showed me – it worked well on short runs, the most recent thread seemed to do well and the recent threads in general did better than the longer established threads.

Even writing this down now makes it seem obvious – my code was in some way biased towards more recently created threads. And so instead searching through all the code looking for errors, I could instead home in on those parts of code that scanned through each thread.

I found such an error quite quickly but testing again showed that while the problem was diminished, it was not by much – I still had not found what was really to blame.

Another scan through the key sections of the code revealed the real error: when a thread tried to page in memory it only examined the reuse distances of itself and those threads created after it.

Thread data was stored in a linked list, but instead of starting the scan at the head of the list, the scan began at the pointer to the current thread. The result was that the most recent thread had a “perfect” OPT experience – on every page in its reuse distances were fully considered, but at the other extreme the first thread’s reuse distances were only considered when it came to page in memory – so the pages it used but were not used by any other thread were fair game – they appeared to have an infinite reuse distance and so were almost certainly the ones chosen for replacement more or less every time.

Fixing the code so that the scan began with the head of the linked list and not just the local pointer fixed the problem and the OPT simulator is now running – my guess is that it is going to show OPT to be two to three times more efficient than LRU.

 

Enhanced by Zemanta