I have spent the last month and half or so writing a particular computer program to model how some code would run on a 16 core “network-on-chip” device.
There were a lot of problems to get over – for although I had already written a Groovy/Java program to do the same thing, that code just was not up to handling the additional level of complexity I wanted for this iteration, so the whole thing had to be redone in C/C++.
About three weeks ago I got to the point where the code compiled and it should have been doing what I wanted, but it was crashing with memory bugs, a lot.
Debugging multi-threaded Posix Threads code is not meant to be easy but I was able – after teaching myself the basics of the GNU debugger (GDB) – to make a lot of progress (having realised that the old programmer’s saw of “if God had wanted us to use debuggers, she wouldn’t have given us printf” does not raise many laughs when you are dealing with multi-threaded code).
I discovered that the bugs were in my four year old red-black tree code. I had used it before in many situations and never had a problem, but also realised that where I used it before (and where I am still using it now), I was continually adding nodes to the tree and not deleting them. The deletion code was wrong.
Most people write their red-black tree code having read Introduction to Algorithms – but here, at least, I was not like most people. I had written the code myself more or less from scratch, piecing together the various things I needed to do from a lot of different online sources. And I was very proud of that too.
But that also meant that, when I tried to fix the code by looking at the canonical description in “Introduction…” the match was poor and my model was, at a quite fundamental level, different (I used null pointers for the leaves, not the “guard node” approach advocated in the book.)
I thought I could get around the issue by not doing any rebalancing after a deletion – after all I was still inserting nodes even as I was deleting them and my insertion code seems (and still seems) quite robust.
But while that allowed by code to run quite a lot longer before falling over, it still fell over. I had to face the facts and use the Standard Template Library (STL) classes – map and set.
I started work on that on Friday on a long train journey. I thought it would be weeks of further chopping and changing. In fact it took little more than 48 hours and I am wondering why I put myself through such misery when robust and well-tested solutions were always to hand (I kept my code for one tree I used where I am only doing insertions – as converting that would have taken a further day, but that was the margin I was operating in – having contemplated weeks, I now worried it would take more than a few hours).
The new code seems pretty robust, but I am struggling to get it on the university’s compute server as that is now under such heavy load, and so I am thinking of adding an nCurses interface while I wait.
Admitting defeat was the quickest way, it seems, to victory. Something worth remembering, perhaps, when your software project is in trouble.