# Chase down that bug

If there are rules for software development, one of them should be never let a bug go unsquashed.

This has been demonstrated to me again this week – when I had a bug in some Microblaze interrupt code. I realised I no longer needed to use the code and then puzzled over whether to find out what was wrong anyway.

And I am very glad I acted on it – not least because it seems to me my problem was actually caused by a small bug in the OVP model for the Microblaze (the simulator model allows interrupts to be executed while an exception is in progress, but a real processor would not) and, hopefully, tracking down the bug there will benefits others in the future.

Last October I ran in the Royal Parks Half Marathon – with the aim of finishing in under two hours. By the time I got to the final mile it was obvious I was not going to make that, but I was actually doing quite well – and then my legs buckled and I sort of walked, ran and hobbled my way to a 2 hours 12 minutes finish (I was at 20k at about 2 hours and 1 minute).

My injuries weren’t that serious but I did have problems walking without painful calves for more than a few weeks, and my Achilles’s Tendons were sore for longer. All-in-all a bit of a disaster. Running all but stopped and even though I was still going to the gym I managed to put on about 7kg in weight (not helped by Christmas excesses – and US portions).

I did not really help myself – I refused to rest the injury properly and kept trying to run – which usually resulted in poor performance and renewed inflammation.

But by the turn of this year I thought I was in a good enough place to really try again. And I am back running…

https://www.strava.com/athletes/7518427/activity-summary/0c7bb9834dcc655d26e07924eb96cf477b5d3614

….but I am still struggling with it.

Last spring, in the run up to my first half marathon I was running close to 50k a week and I remember thinking I could manage running 10k every day. All that feels like a very long time ago.

Since 1 January I have gone out a few times with the intention of running 10k but have yet to manage it – the latest failure was today when a blister (wrong socks) combined with willpower failure to make me stop at 7k (if I had worn the correct socks maybe I would have made it – and stopping was the right thing to do, but it also felt like a good excuse).

To be sure, the cold weather (this winter is colder than last, though today, at last, had an air of spring about it) doesn’t help, but getting that extra 1, 2 or 3 k in seems like a very tall order just now.

I am turning in decent pace, though – at least for training runs (I’m never going to trouble any leader board). But I haven’t made much progress in shifting the weight – I’ve knocked about 2 kilos off from the very worst, but I am still a fair bit heavier than a year ago.

It’s all a bit disappointing really. Perhaps I was too optimistic about how quickly I could recover and should just keep plugging away?

# Hard fault count drives performance

Just to emphasise how hard faults are determining for performance – here is a plot of hard faults versus page count for the same application mentioned in the previous post.

The pattern is very similar, though it should be noted that increasing the page count does still keep the fault count coming down at the far end – but not by enough to outweigh the negative effects of having larger page tables to search through when checking for faults and looking for the least recently used page and the like.

# How much memory do you need?

What’s the best way to speed up your computing experience? As many know the most cost-effective way is often not to buy a new computer with a faster processor but to add more memory to an existing system.

The plot below, based on some results from my PhD research shows just how this works…

In my case I am measuring how long it takes a processor of a fixed speed to execute a given program while I vary the amount of memory available.

My research centres on “network-on-chip” systems so the issue here is how much memory is available locally (i.e., on the chip). Because if instructions and data are not locally available they have to be fetched in from some more “distant” store (whether that is system global memory or even a disk system). And if space on the chip is limited then you generally have to evict some other piece of memory to make way for the piece needed immediately. Yet if you need to evicted piece again in the future you have to take the time needed to fetch it back in again and so on.

In this case we are using 4096 byte (4K) blocks of memory – i.e., 4K pages. So when it says 10 pages on the x-axis that means there is 40K available locally and so on.

I am testing all this on the OVPsim emulator and I have no independent means of accurately timing how long the process takes – but I can count the number of instructions it takes to complete the task.

Factors that affect the number of instructions taken are – in order of likely importance:

• the time taken to fetch whole pages in and out of memory – a “hard fault” occurs when a page has to be evicted and a new page brought in instead (the earliest hard faults are not as slow as the later ones as until the full number of pages are used no eviction is required);
• the time it takes to ensure that pages that are in memory but which are not immediately available for execution (because the number of translation mappings that tell the central processor where they are is limited – to 8 in this test case – and so for each fault we first have to test if the page is available but unmapped). If an unmapped page is available we have a “soft fault” which is much quicker to process than a hard fault as no eviction etc is required;
• picking the next page that will be evicted if have a hard fault – in this case we aim to pick the page that was least recently used, but even then we have to use an approximation. This process is not the same as evicting the page – it merely indicates which one is the best candidate for eviction should that become necessary.

The number of hard faults is reduced by increasing the amount of memory available (until, of course, you have so much memory you never need evict a page). But as you increase the amount of memory you also make checking for soft faults and picking the next candidate for eviction slower – because there are more pages to check.

And this is what we see in the plot. When memory is in short supply adding even a single extra page can make a big difference. There is then a long range where the decrease is not so great but constant. This reflects the fact that memory addresses are not needed at random and that programs show ‘locality of preference’ (ie if you need one address at one moment you are quite likely to need a nearby address at the next). This locality means that adding extra pages can have a limited impact once you have enough to cover the range of likely picks in a reasonable time frame. Then adding extra memory means that the changes between the different phases of locality becomes smoother, so we still see a reduction in time but not anywhere near as pronounced as before.

Then we get to the point – about 30 pages in this case – where we are close to covering the full range of addresses used, even when we switch between phases. In this case we see a sudden fall again until – at about 33 – 35 pages – we seem to have covered every single address our program will ever use.

After that having more memory makes the system slightly slower (NB in most real-world desk and laptop systems adding more memory does not slow down the system – so don’t worry about that!).

The lesson: if your computer is really slow and you don’t want to replace it, add more memory. But if you computer already has so much memory that it doesn’t know what to do with it and it is slow, you have a bigger problem!

# Why stop at twelve?

``` View image | gettyimages.com ```

There is an election coming in Britain and so politicians are setting out their stalls.

Education Secretary Nicky Morgan has now said that her priority will be a “war on innumeracy” and an insistence that children who leave primary schools should know their “twelve times tables”.

But why 12? Until 1971 British (and Irish) coinage was divided into shillings of 12 pence (and 20 shillings made a pound). Then twelve times tables were of a practical use – and naturally enough that carried on for some time afterwards (I was taught my “times tables” in P3 in 1972/73 and 12 was very much part of the programme).

Latterly though, as I understand it, teaching has concentrated on 1 – 10 for what ought to be obvious reasons. So going back to 12 is plainly nothing more than a conservative politician’s attempt to appeal to nostalgic and reactionary notions that the world was better in the past and everything has gone downhill.

I can see no reason at all why learning 11 and 12 times tables actually improves mathematical understanding, as opposed to gaining an additional, but of limited use, practical skill. And if the idea is to be in the “top 5 for maths” then I’d suggest increasing children’s understanding of maths matters a bit more than their ability to know from memory what 11 x 12 is.

Of course, before someone objects, knowing your 12 times tables is useful: but then so is knowing the 13 times tables or the 60 times tables – so why stop at 12?

Indeed, in this age, if we are going to stop at somewhere above 10, shouldn’t it be a power of 2? Knowing the 16 times tables would be a big advantage for any would-be programmer.

It’s a slightly ridiculous idea to demand that our children are taught the 16 times tables, but no more or less than the 12 times tables – so I am putting it out there!

# Squashing the LRU bug

Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.

Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.

Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.

Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.

So now the code works – at least I think it does!

# Curiouser and curiouser – the case of the LRU bug

My LRU queue bug is continuing to puzzle me – and it’s not as simple as a data misalignment. In fact it does not appear to be a data misalignment issue at all: before I was trapping a lot of hardware exceptions under that header because it was a common fault when I got the code wrong, but a closer examination showed it to be an illegal opcode exception.

How that could be caused by the size of the local memory we were simulating was beyond me – but perhaps some code was being pushed out of alignment and an illegal instruction created, I thought.

But that does not appear to be the issue at all – in fact the really puzzling thing is that the exact same string of opcodes at the same addresses runs without a problem in the version with the functional memory sizes as with the “broken” memory sizes.

The only difference seems to be that when the broken code (ie the setup with the non $mod 4$ number of 4k memory pages) raises an illegal opcode exception, the good code raises a page fault.

It looks like it might be a bug in the simulator itself – and having written that I am hoping that the bad workman’s curse now befalls me and I quickly find it was all my fault to begin with. But as of now I am drawing a blank.