Command and Control is not a piece of light reading – in any sense. But it is an absolutely essential book.
It tells the story of the United States’s nuclear weapons programme from the Manhattan Project to the present day, with an emphasis on safety management (with the story of a particular accident in a Titan II missile silo in 1980 foregrounded).
Finishing it you are left wondering why you are there at all – because it is surely more by luck than design that civilisation has managed to survive in the nuclear age – particularly through the forty-five years of the Cold War when, more or less, fundamentally unsafe weapons were handed out willy-nilly to military personnel who were not even vetted for mental illness.
We read of how politicians – Eisenhower, Kennedy, Nixon, Carter – all tried (to various degrees – Eisenhower comes off worst as fundamentally weak man) to get some sort of grip on the nuclear colossus and all essentially capitulated to a military more interested in ensuring their weapons would work when needed, than they were safe when not.
The good news is that the book has a relatively happy ending: in that the end of the Cold War and the persistent efforts of a few scientists and engineers, deep within the US nuclear weapons programme, eventually led to safety being given a greater priority. The chance of an accidental nuclear war is probably less now than it has ever been – but the chance is not zero.
The book, per force, does not give us much insight into the Soviet (or Chinese, or indeed French, British, Indian, Israeli or Pakistani) nuclear programme – was it safer because state control was so much more strict (the fear of Bonapartism), or more dangerous because the Soviets were always running to catch up? The book suggests both at different points.
It’s brilliantly written too – so if you want a bit of chill to match the summer sun in your holiday reading I do recommend it.
Every programmer will be familiar with something like this…
A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.
The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.
To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.
But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.
After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.
Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.
If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.
The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.
Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!
I have just managed to write my first working program (as opposed to a few fragments of code to produce some charts) in R.
It turned out to be harder than I expected: years with BASIC, C, C++, Perl etc mean I made assumptions about the atomic nature of numbers that R’s vector model choked on, and it took me some time to work it out.
I am trying to execute some self-written pieces of software that require a lot of wall clock time – around three weeks.
I run them on the University of York‘s compute server which is rebooted on the first Tuesday of every month, so the window for the software is limited. I have until about 7 am on 5 August before the next reboot.
To add to the complication the server runs Kerberos which does not seem to play well with the screen/NFS combination I am using.
And – I keep killing the applications in error – this time, just half an hour ago I assume I was on a dead terminal session (ie an ssh login which had long since expired) and pressed ctrl-C, only to my horror to discover it was a live screen (it had not responded to ctrl-A, ctrl-A for whatever reason).
Time to add a signal handler to catch ctrl-C to at least give me the option of changing my mind!
This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.
So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.
Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).
I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.
That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.
These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.
What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.
Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).
I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).
All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.
I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.
So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.
The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.
The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.
But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!
But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.