Flirting with disaster

A week ago, in the middle of the Christmas holidays, I contemplated how I could tie all the bits I had written for my PhD thesis into a coherently and seamlessly argued complete first draft.

I even bought a book about completing a PhD, In the end I decided I needed to add in one more set of experimental data – this time examining how highly parallel code might work in a many-core environment.

The results I had were over two years old and the software simulation I used didn’t quite match the code used later (the later code was a development of the earlier stuff) but I thought if I made that clear, there wouldn’t be a problem.

But a quick test of the model showed that what I had produced before was just too flakey, especially when I decided what I really had to do was compare the potential results of a system with 256 cores with a system with, say, 64 cores: the system with a lower number of cores cannot tackle the whole problem at once, but maybe the combination of faster cores, shorter queues for memory and less idle time (as processors will have to be set to tackle the unsolved problems) might still deliver better performance (I don’t know the answer – yet – to this “Amdahl’s Law” related question).

So work was needed on the code but try as I might I couldn’t get it to work.

Eventually I tracked the issue down to pages of memory being incorrectly marked as “read only” when they should be “read-write”. Read only pages stored locally don’t get written back to main memory and when you are using shared memory to communicate that’s a problem – you write your messages locally but they never get delivered.

I thought I found the bad code – a wayward #define – and a small test seemed to show that this did indeed fix the problem.

Except it didn’t and I realised last night that another piece of code was not creating “read only” page table entries in error (as the first piece of bad code did) but was converting “read write” page table entries to read only (it was meant to do the opposite – ie if you wrote to a read only page the page table entry was suitably updated, but sloppy coding meant instead it was routinely converting “read write” entries to “read only”.)

And the really awful thing was that – unlike the bad #define – this was a piece of code that had been inherited by every subsequent iteration of my simulator.

In other words – an awful lot of results I thought were definitive were wrong. Not actually wrong at such a fundamental level that I had just discovered I’d wasted several years of my life (I recently read a horror story about how a chemistry PhD wrote a whole thesis only to discover his or her “interesting result” was caused by accessing unallocated memory in Fortran 77) – but enough to mean the numbers are wrong.

Initial reaction was panic. Today it’s more stoical – I now have 11 simulations running to get new data and I probably need the same number or so again to fill in all the gaps.

The only positive is that it all has come so late in the process that I am confident in handling and interpreting the data that will come out of the fixed experiments.

But the downside is that I am running out of time and these experiments quite literally take weeks to run – and if there is a hitch with the university IT setup I go straight back to square one.

Wish me luck.

Advertisements

One comment

  1. You have my sympathies. Been there, done that (although not with the stakes quite so high). Hopefully time will dilate as needed.

Comments are closed.