Progress is not the only option


The global pandemic of covid-19 is, in its way, a triumph for the scientific method: scientists warned for a long time of the danger of a pandemic caused by a novel virus and so it has come to pass.

But in the crisis we shouldn’t forget all the other issues science warns us about – and here’s something else to cheer you up: even a ‘limited’ nuclear war in (for Europeans and Americans) far off parts of the world could cause a decade of starvation.

The concept of a nuclear winter isn’t a new one – and if you’ve ever watched Threads you are unlikely to be under any illusions about just how devastating the climate collapse that would follow a full-scale nuclear exchange would be.

But even a ‘limited’ nuclear exchange between India and Pakistan – two countries which have engaged in full-scale war three times in 80 years and where incidents of military conflict are frequent – would be devastating to global food supplies according to a new study published in the Proceedings of the National Academy of Sciences in the US.

“A regional nuclear conflict would compromise global food security” is based on a scenario of 100 15 kilotonne strikes (i.e. similar in yield and numbers if two British Vanguard class submarines fired off all their missiles). They estimate that the soot from the fires created would lower the global temperature by 1.8 celsius and that this would do much more damage than a 1.8 degree warming caused by carbon dioxide, because the carbon dioxide would also encourage growth.

Their abstract reads:

A limited nuclear war between India and Pakistan could ignite fires large enough to emit more than 5 Tg of soot into the stratosphere. Climate model simulations have shown severe resulting climate perturbations with declines in global mean temperature by 1.8 °C and precipitation by 8%, for at least 5 y. Here we evaluate impacts for the global food system. Six harmonized state-of-the-art crop models show that global caloric production from maize, wheat, rice, and soybean falls by 13 (±1)%, 11 (±8)%, 3 (±5)%, and 17 (±2)% over 5 y. Total single-year losses of 12 (±4)% quadruple the largest observed historical anomaly and exceed impacts caused by historic droughts and volcanic eruptions. Colder temperatures drive losses more than changes in precipitation and solar radiation, leading to strongest impacts in temperate regions poleward of 30°N, including the United States, Europe, and China for 10 to 15 y. Integrated food trade network analyses show that domestic reserves and global trade can largely buffer the production anomaly in the first year. Persistent multiyear losses, however, would constrain domestic food availability and propagate to the Global South, especially to food-insecure countries. By year 5, maize and wheat availability would decrease by 13% globally and by more than 20% in 71 countries with a cumulative population of 1.3 billion people. In view of increasing instability in South Asia, this study shows that a regional conflict using <1% of the worldwide nuclear arsenal could have adverse consequences for global food security unmatched in modern history.

The impact would be global:

Impacts on global maize production

Why bring it up now, just as we are facing another crisis of deep and lasting significance? Because nothing breeds conflict more than internal stress in a state. The impact of covid-19 on India or Pakistan will certainly not be positive and if it pushes either state towards conflict that matters for all of us.

More than that, the pandemic should be the opportunity to drum home the point that we need to solve conflicts and problems, not just hope they will go away if we ignore them.

Traffic generation options


English: Read Only Memory
English: Read Only Memory (Photo credit: Wikipedia)

This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.

So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.

Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).

I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.

That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.

These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.

What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.

Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).

I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).

All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.

I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.

So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have  a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.

The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.

The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.

But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!

But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.

OPT and LRU compared for multithreaded code


Here is the fruit of many months labour:

simulationInterestingly, the two series are very close to isomorphic (i.e., they have essentially the same shape if one scales the x-plot appropriately).

This is based on running a short piece of code to decode 32 frames of video – the memory model used was 4k pages with a pooled 512k on chip (each chip can access any part of the pool at the cost of 1 tick) and unlimited memory 100 ticks away per 128 bits of read-in (so for a 4k page it takes 25,600 ticks to fault the page in).

This is a bit of a crude simplification of the memory model seen on, say, a Tilera TILE64 system, but it’s close enough (though this simulation had 16 cores).

Enhanced by Zemanta

Deductive debugging


An illustration of multithreading where the ma...
An illustration of multithreading where the master thread forks off a number of threads which execute blocks of code in parallel. (Photo credit: Wikipedia)

This is not a story of a great debugging triumph – but it is one that points to a great debugging truth – study of the bug before you start to pore over your code is more likely to get you to a solution faster.

My example is my code to simulate Belady’s “OPT” page replacement algorithm for a multithreaded environment.

In OPT the idea is that, when we need to make room in main memory for a new page, we select for removal the page with the longest “resuse distance” – in other words, the one we will have to wait the longest (perhaps forever) before needing to use again. This algorithm is sometimes called the “clairvoyant” algorithm because it requires foreknowledge of which memory page will get used when. That does not happen very often in general use computing, but can often be the case in embedded computing, where the code does exactly the same thing over and over again.

In my case I am using a memory trace from a transcoding of 32 frames of video – a small (in terms of time on a real computing device) example of the sort of parallel task you might see in embedded devices. In the real world this runs for a few seconds – but it also generates 220GB of XML trace records spread across 18 threads.

With a single thread it’s easy to work out the reuse distance – you just look at how long it will be before a page gets referenced: you could even do this statically ahead of runtime and just the result up if you wanted.

That is not true in multithreaded code – one thread might run fast or low (eg while waiting for IO) and so the only way to do it is to measure the reuse distances for each page and for every thread:

  • For each thread calculate the minimum reuse distance of each page;
  • Then pick the page with the maximum minimum reuse distance across all threads.

I wrote some code that seemed to do this and on my testing subset of the 220GB XML it seemed to deliver good results. But whenever I ran it against the full trace it started brightly but then performance – by which I mean how fast the simulator ran through the traces in terms of the synthetic ticks generated by the simulator, or bluntly the simulated performance – just seemed to get worse and worse.

In fact the longer a simulated thread seemed to be running, the worse its performance got and the “fastest” thread was always the most recently spawned one, and the more threads that ran the worse this problem got.

Now, the combination of severely limited memory (in this case we were simulating a 16 core NoC with 32Kb local memory per core, which is pooled into one flat 512Kb space), performance can go downhill quite badly as the thread count climbs – though this fall off was catastrophic and it was plain that OPT was going to be worse than “least recently used” (LRU) – and that just cannot be correct! I have not sat down to write a mathematical proof of that but instinctively I know it to be true…

Reading through the code did not draw my attention to any obvious flaws, so I had to sit down and think about what the bug showed me – it worked well on short runs, the most recent thread seemed to do well and the recent threads in general did better than the longer established threads.

Even writing this down now makes it seem obvious – my code was in some way biased towards more recently created threads. And so instead searching through all the code looking for errors, I could instead home in on those parts of code that scanned through each thread.

I found such an error quite quickly but testing again showed that while the problem was diminished, it was not by much – I still had not found what was really to blame.

Another scan through the key sections of the code revealed the real error: when a thread tried to page in memory it only examined the reuse distances of itself and those threads created after it.

Thread data was stored in a linked list, but instead of starting the scan at the head of the list, the scan began at the pointer to the current thread. The result was that the most recent thread had a “perfect” OPT experience – on every page in its reuse distances were fully considered, but at the other extreme the first thread’s reuse distances were only considered when it came to page in memory – so the pages it used but were not used by any other thread were fair game – they appeared to have an infinite reuse distance and so were almost certainly the ones chosen for replacement more or less every time.

Fixing the code so that the scan began with the head of the linked list and not just the local pointer fixed the problem and the OPT simulator is now running – my guess is that it is going to show OPT to be two to three times more efficient than LRU.

 

Enhanced by Zemanta

Sometimes, admitting defeat is the very best thing you can do


English: Screenshot of GDB, the debugger of th...
English: Screenshot of GDB, the debugger of the GNU Project. . (Photo credit: Wikipedia)

I have spent the last month and half or so writing a particular computer program to model how some code would run on a 16 core “network-on-chip” device.

There were a lot of problems to get over – for although I had already written a Groovy/Java program to do the same thing, that code just was not up to handling the additional level of complexity I wanted for this iteration, so the whole thing had to be redone in C/C++.

About three weeks ago I got to the point where the code compiled and it should have been doing what I wanted, but it was crashing with memory bugs, a lot.

Debugging multi-threaded Posix Threads code is not meant to be easy but I was able – after teaching myself the basics of the GNU debugger (GDB) – to make a lot of progress (having realised that the old programmer’s saw of “if God had wanted us to use debuggers, she wouldn’t have given us printf” does not raise many laughs when you are dealing with multi-threaded code).

I discovered that the bugs were in my four year old red-black tree code. I had used it before in many situations and never had a problem, but also realised that where I used it before (and where I am still using it now), I was continually adding nodes to the tree and not deleting them. The deletion code was wrong.

Most people write their red-black tree code having read Introduction to Algorithms – but here, at least, I was not like most people. I had written the code myself more or less from scratch, piecing together the various things I needed to do from a lot of different online sources. And I was very proud of that too.

But that also meant that, when I tried to fix the code by looking at the canonical description in “Introduction…” the match was poor and my model was, at a quite fundamental level, different (I used null pointers for the leaves, not the “guard node” approach advocated in the book.)

I thought I could get around the issue by not doing any rebalancing after a deletion – after all I was still inserting nodes even as I was deleting them and my insertion code seems (and still seems) quite robust.

But while that allowed by code to run quite a lot longer before falling over, it still fell over. I had to face the facts and use the Standard Template Library (STL) classes – map and set.

I started work on that on Friday on a long train journey. I thought it would be weeks of further chopping and changing. In fact it took little more than 48 hours and I am wondering why I put myself through such misery when robust and well-tested solutions were always to hand (I kept my code for one tree I used where I am only doing insertions – as converting that would have taken a further day, but that was the margin I was operating in – having contemplated weeks, I now worried it would take more than a few hours).

The new code seems pretty robust, but I am struggling to get it on the university’s compute server as that is now under such heavy load, and so I am thinking of adding an nCurses interface while I wait.

Admitting defeat was the quickest way, it seems, to victory. Something worth remembering, perhaps, when your software project is in trouble.

Enhanced by Zemanta