The agony and the ecstasy of debugging

If you have ever written a computer program with any degree of seriousness then you will know the feeling: your heart sinking as you realise what you thought was a perfectly good piece of code has a bug somewhere less than obvious.

In my case this has happened twice in a week and both times has meant the work I had done as part of my PhD has had to start again (not all of it, obviously, but this, most recent bit). Yesterday evening’s realisation was particularly annoying because it came after I had sent my supervisor an email suggesting I had some quite interesting and counter-intuitive results to share.

Since then I had spent quite a few hours trying to work out what on Earth was wrong – debugging assembly is not complex in the sense that most instructions do simple things – but it also reminds you of the essential state-machine-nature of a general computing device: there are lots of things to track.

Of course, that also brings pleasure – there is no point in denying that solving these problem is one of the truly engaging things about computing.

Job is done now and I am once again collecting results and hoping that I do not spot another flaw.

Chase down that bug

Français :
(Photo credit: Wikipedia)

If there are rules for software development, one of them should be never let a bug go unsquashed.

This has been demonstrated to me again this week – when I had a bug in some Microblaze interrupt code. I realised I no longer needed to use the code and then puzzled over whether to find out what was wrong anyway.

I got some sound advice:

And I am very glad I acted on it – not least because it seems to me my problem was actually caused by a small bug in the OVP model for the Microblaze (the simulator model allows interrupts to be executed while an exception is in progress, but a real processor would not) and, hopefully, tracking down the bug there will benefits others in the future.

LRU queue strangeness

Prinzipdarstellung der Arbeitsweise einer MMU
(Photo credit: Wikipedia)

For the last week or so I have been writing and then debugging (and so mainly debugging) a least-recently-used (LRU) page replacement system on my Microblaze simulation.

Perhaps I shouldn’t have bothered – I had a working first-in-first-out (FIFO) system after all. But no one seriously uses FIFO, so I had to write some LRU code.

I thought I had it working tonight – it ran through the target exercise in about 6 million instructions (as the MMU in a Microblaze is crude, memory gets loaded in and out ‘by hand’ and so the instruction count measures time/efficiency) when the system had 32 4k local pages and in about 10.5 million instructions when it had 24 4k pages available locally – all seems fine: less pages means more time is required.

But then things started to get weird – testing the system with 28 pages took about 9 million instructions, but when I tried to use 26 pages I had to break the code after it had executed 14 trillion instructions.

Indeed it seems to only work for a very limited number of page counts. How odd – though a typically debuggers story. A week in, finally thinking I’d cracked it when some really strange behaviour manifests itself.

Update: It appears to be an unaligned data exeception issue. Somewhere along the line a piece of code relies on the LRU queue to be a multiple of 4 in length would be my guess…

Sometimes, admitting defeat is the very best thing you can do

English: Screenshot of GDB, the debugger of th...
English: Screenshot of GDB, the debugger of the GNU Project. . (Photo credit: Wikipedia)

I have spent the last month and half or so writing a particular computer program to model how some code would run on a 16 core “network-on-chip” device.

There were a lot of problems to get over – for although I had already written a Groovy/Java program to do the same thing, that code just was not up to handling the additional level of complexity I wanted for this iteration, so the whole thing had to be redone in C/C++.

About three weeks ago I got to the point where the code compiled and it should have been doing what I wanted, but it was crashing with memory bugs, a lot.

Debugging multi-threaded Posix Threads code is not meant to be easy but I was able – after teaching myself the basics of the GNU debugger (GDB) – to make a lot of progress (having realised that the old programmer’s saw of “if God had wanted us to use debuggers, she wouldn’t have given us printf” does not raise many laughs when you are dealing with multi-threaded code).

I discovered that the bugs were in my four year old red-black tree code. I had used it before in many situations and never had a problem, but also realised that where I used it before (and where I am still using it now), I was continually adding nodes to the tree and not deleting them. The deletion code was wrong.

Most people write their red-black tree code having read Introduction to Algorithms – but here, at least, I was not like most people. I had written the code myself more or less from scratch, piecing together the various things I needed to do from a lot of different online sources. And I was very proud of that too.

But that also meant that, when I tried to fix the code by looking at the canonical description in “Introduction…” the match was poor and my model was, at a quite fundamental level, different (I used null pointers for the leaves, not the “guard node” approach advocated in the book.)

I thought I could get around the issue by not doing any rebalancing after a deletion – after all I was still inserting nodes even as I was deleting them and my insertion code seems (and still seems) quite robust.

But while that allowed by code to run quite a lot longer before falling over, it still fell over. I had to face the facts and use the Standard Template Library (STL) classes – map and set.

I started work on that on Friday on a long train journey. I thought it would be weeks of further chopping and changing. In fact it took little more than 48 hours and I am wondering why I put myself through such misery when robust and well-tested solutions were always to hand (I kept my code for one tree I used where I am only doing insertions – as converting that would have taken a further day, but that was the margin I was operating in – having contemplated weeks, I now worried it would take more than a few hours).

The new code seems pretty robust, but I am struggling to get it on the university’s compute server as that is now under such heavy load, and so I am thinking of adding an nCurses interface while I wait.

Admitting defeat was the quickest way, it seems, to victory. Something worth remembering, perhaps, when your software project is in trouble.

Enhanced by Zemanta

“Dreaming in Code” – a review

I did not actually read Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software – I listened to it as I pounded treadmills and pulled cross-trainers and so on in the gym.

1-2-3 boingmash... Mark Frauenfelder, Xeni Jar...
1-2-3 boingmash… Mark Frauenfelder, Xeni Jardin, Cory Doctorow and Mitch Kapor. (Photo credit: Wikipedia)

That ought to be a giveaway that it doesn’t actually contain any code or maths or anything else that might require some dedicated concentration, but that does not mean it is not worth reading (or listening to) if you are a programmer, manage programmers or in some way are responsible for the development or purchase of software (it is plain that few or no people at the DWP have read this book – given their travails over the “Universal Credit” project – someone should stick a copy in each minister’s red box pronto).

I have never worked as a professional software developer – though I have written code for money – but still I found this book had a lot of insight, and even manages to explain things such as the halting problem and infinite recursion in a way that non computer scientists are likely to grasp without boring those of us who know what these are.

The book is incomplete, though in that it was written before what looks like the final collapse of the project it describes – the Chandler PIM – in 2008/9 when founder Mitch Kapor withdrew. Chandler sounded like a great idea (like Universal Credit?) but, as the project drags on and on, one begins to wonder what on earth the development team were up to for most of the time they worked on it.

Well worth reading.

One problem about the audio though – I know the American fashion is to mispronounce French words and I don’t want to sound like a European prig (after all this book is about a vital technology in which the US leads the world) – but it goes too far when Albert Camus‘s name is repeatedly made to rhyme with bus!

How to get a job as a developer

Usage share of web browsers (excluding IE) acc...

Last night I went to a Birkbeck training session for prospective mentors. I did not realise before I turned up that all, or almost all, the would-be mentors would be MSc Computer Science graduates.

In the end that fact alone turned what could have been a pretty dull way to spend a Friday night into something quite interesting – I don’t get to talk to developers very often at all, and now I was in a room full of them.

And one of them – a chief executive of a start-up with a fascinating back-story (but he didn’t say ‘put that on your blog’, so I won’t) – told me what he regards as the best way for a would be developer to get their breakthrough job: go to github, find a high profile project from a commercial outfit (he suggested the Chrome browser from Google) and fix a few bugs.

His claim was that he knew several people – including two with jobs at Google – who had got work in this way. I have no reason to think he was doing anything other than telling the truth.

Interestingly, he was pretty surprised when I talked about the poor employment record of computer science graduates – there plainly is some sort of disconnect between the firms recruiting (who say they struggle to fill jobs) and the graduates (who struggle to get recruited).

Making a hash of universal credit

A hash algorithm, for computer scientists, is a way of turning one long string (some words, a number etc) into a shorter “hash code“.

Hash function
Hash function (Photo credit: Wikipedia)

Hashing is used in multiple ways – for instance to check that a file you have downloaded matches the one on the server (you hash the downloaded file data and check it matches the advertised hash on the server). This is much quicker than comparing the numbers byte by byte and, if a good hashing algorithm is chosen, the chances of the “collision” (in other words the false matching of two hashes) are low, especially for the common types of error, so you can be pretty sure a matched hash means the download is a good one.

It seems that the UK’s Department for Work and PensionsUniversal Credit” project relies extensively on hashing to check that their data on people’s incomes are correct. And the failure to match hashes is very high – suggesting some sort of fundamental failure in the system.

I have written about the enormous risk that the Universal Credit project represents – at a time of dwindling resource budgets the government is seeking to deliver an IT project that has a direct bearing on the lives of millions of the most financially vulnerable people to a very tight deadline and using a development method – “agile” – that has been sold as the answer to all the problems of government IT despite the fact most software development textbooks will tell you this is not what you should be using “agile” for.

It might still come off – certainly the department are refusing to admit that there is any possibility of failure – but I think the evidence that it is not ready is beginning to amass.

At least, though, the latest story, even if it is not encouraging in terms of the project’s overall chances of success, does show that the DWP are at least carrying out part of the “agile” function – testing the software. Previously there was next to no sign of it. But “agile” is also supposed to mean being open about progress, getting “stakeholder buy-in” (yes, I know it’s a horrible phrase, but it communicates what this is about) and getting lots of user feedback as the development process goes on. Where is any of that? The less we see of it, the more it looks like the department have got something to hide.

And, one final comment. Ruth Owen from HMRC appears to quote a much lower hash match failure rate – 5% instead of 25% – but 5% in the real world would mean literally millions of failures every year. A 5% failure rate would be as broken as a 25% failure rate.