## The agony and the ecstasy of debugging

Embed from Getty Images
If you have ever written a computer program with any degree of seriousness then you will know the feeling: your heart sinking as you realise what you thought was a perfectly good piece of code has a bug somewhere less than obvious.

In my case this has happened twice in a week and both times has meant the work I had done as part of my PhD has had to start again (not all of it, obviously, but this, most recent bit). Yesterday evening’s realisation was particularly annoying because it came after I had sent my supervisor an email suggesting I had some quite interesting and counter-intuitive results to share.

Since then I had spent quite a few hours trying to work out what on Earth was wrong – debugging assembly is not complex in the sense that most instructions do simple things – but it also reminds you of the essential state-machine-nature of a general computing device: there are lots of things to track.

Of course, that also brings pleasure – there is no point in denying that solving these problem is one of the truly engaging things about computing.

Job is done now and I am once again collecting results and hoping that I do not spot another flaw.

## Chase down that bug

If there are rules for software development, one of them should be never let a bug go unsquashed.

This has been demonstrated to me again this week – when I had a bug in some Microblaze interrupt code. I realised I no longer needed to use the code and then puzzled over whether to find out what was wrong anyway.

And I am very glad I acted on it – not least because it seems to me my problem was actually caused by a small bug in the OVP model for the Microblaze (the simulator model allows interrupts to be executed while an exception is in progress, but a real processor would not) and, hopefully, tracking down the bug there will benefits others in the future.

## Hard fault count drives performance

Just to emphasise how hard faults are determining for performance – here is a plot of hard faults versus page count for the same application mentioned in the previous post.

The pattern is very similar, though it should be noted that increasing the page count does still keep the fault count coming down at the far end – but not by enough to outweigh the negative effects of having larger page tables to search through when checking for faults and looking for the least recently used page and the like.

## Squashing the LRU bug

Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.

Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.

Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.

Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.

So now the code works – at least I think it does!

## Curiouser and curiouser – the case of the LRU bug

My LRU queue bug is continuing to puzzle me – and it’s not as simple as a data misalignment. In fact it does not appear to be a data misalignment issue at all: before I was trapping a lot of hardware exceptions under that header because it was a common fault when I got the code wrong, but a closer examination showed it to be an illegal opcode exception.

How that could be caused by the size of the local memory we were simulating was beyond me – but perhaps some code was being pushed out of alignment and an illegal instruction created, I thought.

But that does not appear to be the issue at all – in fact the really puzzling thing is that the exact same string of opcodes at the same addresses runs without a problem in the version with the functional memory sizes as with the “broken” memory sizes.

The only difference seems to be that when the broken code (ie the setup with the non $mod 4$ number of 4k memory pages) raises an illegal opcode exception, the good code raises a page fault.

It looks like it might be a bug in the simulator itself – and having written that I am hoping that the bad workman’s curse now befalls me and I quickly find it was all my fault to begin with. But as of now I am drawing a blank.

## LRU queue strangeness

For the last week or so I have been writing and then debugging (and so mainly debugging) a least-recently-used (LRU) page replacement system on my Microblaze simulation.

Perhaps I shouldn’t have bothered – I had a working first-in-first-out (FIFO) system after all. But no one seriously uses FIFO, so I had to write some LRU code.

I thought I had it working tonight – it ran through the target exercise in about 6 million instructions (as the MMU in a Microblaze is crude, memory gets loaded in and out ‘by hand’ and so the instruction count measures time/efficiency) when the system had 32 4k local pages and in about 10.5 million instructions when it had 24 4k pages available locally – all seems fine: less pages means more time is required.

But then things started to get weird – testing the system with 28 pages took about 9 million instructions, but when I tried to use 26 pages I had to break the code after it had executed 14 trillion instructions.

Indeed it seems to only work for a very limited number of page counts. How odd – though a typically debuggers story. A week in, finally thinking I’d cracked it when some really strange behaviour manifests itself.

Update: It appears to be an unaligned data exeception issue. Somewhere along the line a piece of code relies on the LRU queue to be a multiple of 4 in length would be my guess…

## Simulating global and local memory on OVP

Had a good meeting with my PhD supervisor today: he was in London – I didn’t have to make a flying visit to York.

So the next steps with my OVPsim Microblaze code is to model global and local memory – by default OVPsim treats all memory as local, mapping the full 32-bit address space and sparsely filling that as needed. I have imposed an additional constraint of only mapping a few pages, but these are still all “local”: so while the code takes time to execute, what is in effect, a FIFO page replacement algorithm, there is no time for page loads.

The way round this seems to be to build my global memory as a memory-mapped peripheral device – I can then impose time delays on reads and writes.

## Picking which page to remove

I have now written some very basic but functional paging code for the Microblaze but have realised I am, essentially, going to have to start a lot of it again.

The Microblaze puts page tables firmly under software control and its memory management unit (MMU) concentrates on managing page references through its translation lookaside buffers (TLBs) instead. So a reference to a virtual address that is mapped in a TLB is handled invisibly to the user or programmer, but a reference to a memory address that may be mapped in a page table but not in the TLB generates an exception and everything is dumped into the lap of the programmer – this is what is often known as a “soft fault” (a “hard fault” being when there is no version of the page being referenced available in physical memory at all).

The number of TLBs is quite limited – just 64 – so to handle (or in my case, simulate), say, 512kB of physical memory through 4kB page frames you need to expect to handle quite a few soft as well as hard faults and so you need to take a view on which mappings get evicted from the TLB and – if you are referencing more than 512kB of program and data – which pages get evicted from the page table and, hence, physical memory.

This is where the difficulties start, because there is no simple way to know when a read through a TLB mapping takes place – the very point is that it is invisible (and hence quick) to the programmer or any software she or he may have written. The problem with that is that – for most situations – the only guide you have to future behaviour in terms of memory references is past behaviour: so it would be very helpful to know whether a page has been recently accessed (on the basis that if it was accessed recently then it will be accessed again soon).

The standard response to this sort of problem is some form of “CLOCK” algorithm – which was first implemented, I believe, for the Multics operating system.

Multics can be thought of as the estranged (and now late) father of Unix – the “Uni” being a play on the “Multi” of the earlier system – and both direction and through its offspring its influence on computers has been profound and one of our inheritances is CLOCK, some version of which is almost certainly running in the computer, phone or tablet on which you are reading this.

The principle of CLOCK is simple. A “clock hand” sweeps regularly through the page mappings marking them invalid, then on a subsequent attempt to reuse the page mapping the valid bit has to be reset (meaning the page has been used recently) or alternatively if a new mapping is needed then we can through out the first page in the list of mappings where the mapping is marked as invalid.

And this, or some version of it is what I am now going to have to implement for the Microblaze. The obvious thing to do is to have some sort of timer interrupt drive the time clock – though I am not even sure the Microblaze has a timer interrupt available – I’m guessing it doesn’t – as it would expect those to come from the board, so this could be tricky!

## Why I owe @slugonamission a (or several) large drink(s)

I have been struggling with a problem with my Microblaze simulation on OVPsim all week.

I have been trying to make a start on implementing a simple demand paging system and to trigger the hardware exception that the Microblaze CPU manual says should happen when you either try to read from a mapped page that is not in the TLB (ie, a “minor” page fault) or has not been mapped at all (a “major” page fault).

My problem was that – despite having, I thought, specified all the correct parameters for the Microblaze and knowing that I had virtual memory mappings work well, it simply was not happening for me. Instead of the expected exception handling, the simulation reported an error and exited.

But Jamie Garside from the University of York’s computer science department saved my life by (correctly) pointing out that what I also needed to do was turn on `ICM_ATTR_SIMEX`. Otherwise the simulation will always exit on an exception (or halt in GDB).

I can see why that might make sense – at least the halting in the debugger bit: if an exception is about to be fired you can halt the machine while it is still executing in virtual, as opposed to real, mode and see what has caused it.

It was also a reminder to RTFM – in this case not just the Microblaze manual, but also the OVPsim manual. I cannot rely on Jamie doing that for me everytime.

So to fix my problem I ORed in the missing attribute:

`#define SIM_ATTRS (ICM_ATTR_DEFAULT|ICM_ATTR_SIMEX)`

## Latest Microblaze puzzle

Having successfully written a couple of simple virtual memory mappings I now want to write the exception handling code that will allow me to swap mappings in and out – so I did this:

```	ori	r20, r0, 0
ori	r12, r0, 0x1000000
/* next line should break it */
sw	r20, r12, r0
```

I expected that the marked line would raise a hardware exception as a TLB miss and I’d be taken to the appropriate vector etc. Instead it just tells me I’ve had a SIGSEGV.

Hmmm.