The agony and the ecstasy of debugging


Embed from Getty Images
If you have ever written a computer program with any degree of seriousness then you will know the feeling: your heart sinking as you realise what you thought was a perfectly good piece of code has a bug somewhere less than obvious.

In my case this has happened twice in a week and both times has meant the work I had done as part of my PhD has had to start again (not all of it, obviously, but this, most recent bit). Yesterday evening’s realisation was particularly annoying because it came after I had sent my supervisor an email suggesting I had some quite interesting and counter-intuitive results to share.

Since then I had spent quite a few hours trying to work out what on Earth was wrong – debugging assembly is not complex in the sense that most instructions do simple things – but it also reminds you of the essential state-machine-nature of a general computing device: there are lots of things to track.

Of course, that also brings pleasure – there is no point in denying that solving these problem is one of the truly engaging things about computing.

Job is done now and I am once again collecting results and hoping that I do not spot another flaw.

Squashing the LRU bug


English: Typical actions taken upon a virtual-...
English: Typical actions taken upon a virtual-physical address translation. (Photo credit: Wikipedia)

Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.

Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.

Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.

Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.

So now the code works – at least I think it does!

Curses on ncurses


gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!

Trying to stay calm


English: MDB Debugger Screen Shot
English: MDB Debugger Screen Shot (Photo credit: Wikipedia)

Debugging can be deeply frustrating. Not knowing how the debugger works can make it worse.

Still, there are always books (ordered today) and the deeper assurance that, in computing, nothing is truly random – your code breaks for a reason: find the reason and you fix the code.

At least that is what I am telling myself.

Enhanced by Zemanta

Going on a bug(?) hunt


I now have some code that is meant to parse an XML file of approximately 5 billion lines.

Flow of data in a typical parser
Flow of data in a typical parser (Photo credit: Wikipedia)

Unfortunately it fails, every time (it seems), on line 4,295,025,275.

This is something of a nightmare to debug – but it looks like an overflow bug (in the xerces-c parser) of some sort.

Do I try to find it by inspection or by repeating the runs (it takes about 4 – 5 hours to get to the bug point)?

One is probably quite difficult but simple to organise. The other is (relatively) easier – just step through the code – but is perhaps impossible to organise – how many weeks of wall clock time in a debugger before we get to that instance?

How to get a job as a developer


Usage share of web browsers (excluding IE) acc...

Last night I went to a Birkbeck training session for prospective mentors. I did not realise before I turned up that all, or almost all, the would-be mentors would be MSc Computer Science graduates.

In the end that fact alone turned what could have been a pretty dull way to spend a Friday night into something quite interesting – I don’t get to talk to developers very often at all, and now I was in a room full of them.

And one of them – a chief executive of a start-up with a fascinating back-story (but he didn’t say ‘put that on your blog’, so I won’t) – told me what he regards as the best way for a would be developer to get their breakthrough job: go to github, find a high profile project from a commercial outfit (he suggested the Chrome browser from Google) and fix a few bugs.

His claim was that he knew several people – including two with jobs at Google – who had got work in this way. I have no reason to think he was doing anything other than telling the truth.

Interestingly, he was pretty surprised when I talked about the poor employment record of computer science graduates – there plainly is some sort of disconnect between the firms recruiting (who say they struggle to fill jobs) and the graduates (who struggle to get recruited).

Taking down the bug


Finally nailed the bug in my readdir function.

The code was stuck in an endless loop because it did not return an empty dirent when it reached the end of the directory – instead returning the list of files in the directory over and over again.

To fix this I made sure that any following scan of the directory began where the previous one left off. As this would return an empty dirent, the problem was solved.

A question for a C guru


English: Dada guru
Image via Wikipedia
The debugging continues, but unfortunately I have not yet been able to identify what is fundamentally broken with my file system code.

But along the way I have spotted and fixed various bugs. Here is one, I am not even sure why it compiled in the first place, so maybe a C guru could tell me…

Was:

if le16_to_cpu(((u16 *) bh->b_data)[j * VMU_DIR_RECORD_LEN16 + VMUFAT_FIRSTBLOCK_OFFSET16] == ino)

Now:

if (le16_to_cpu(((u16 *) bh->b_data)[j * VMU_DIR_RECORD_LEN16 + VMUFAT_FIRSTBLOCK_OFFSET16]) == ino)

One of those debugging nights


English: An NTSC Sega Dreamcast Console and PA...
Image via Wikipedia

It must be a situation every programmer is familiar with – you write some code, you are pretty pleased with it, but it just doesn’t work and while you know it is only a small thing that is stopping it from functioning properly, but you just cannot track it down.

That is exactly where I am with my code for the “VMUFAT” file system – the file system for the Sega Dreamcast VMU. After a decade of trying (in fits and starts), I now have a code base which I think is good enough, or almost good enough, to post to LKML without me being burnt to a crisp by the subsequent flames.

But it just doesn’t work! Trying to list the files on the device gets a list of 200 copies of the same file (200 is the maximum number of files that a standard VMU can index in its directory blocks and the code throws an error when you exceed that).

The thing is, the old, ugly, code worked at this. But I cannot see what I am doing wrong with the new, clean, code. Indeed it seems to replicate exactly the same functionality. Last night I picked up a couple of bugs in the code quickly but, whilst they addressed one error, they did not fix the big one.

Time for a rotary debugger? Well, it will be a roti debugger in this house, this Friday evening.