Deductive debugging


An illustration of multithreading where the ma...
An illustration of multithreading where the master thread forks off a number of threads which execute blocks of code in parallel. (Photo credit: Wikipedia)

This is not a story of a great debugging triumph – but it is one that points to a great debugging truth – study of the bug before you start to pore over your code is more likely to get you to a solution faster.

My example is my code to simulate Belady’s “OPT” page replacement algorithm for a multithreaded environment.

In OPT the idea is that, when we need to make room in main memory for a new page, we select for removal the page with the longest “resuse distance” – in other words, the one we will have to wait the longest (perhaps forever) before needing to use again. This algorithm is sometimes called the “clairvoyant” algorithm because it requires foreknowledge of which memory page will get used when. That does not happen very often in general use computing, but can often be the case in embedded computing, where the code does exactly the same thing over and over again.

In my case I am using a memory trace from a transcoding of 32 frames of video – a small (in terms of time on a real computing device) example of the sort of parallel task you might see in embedded devices. In the real world this runs for a few seconds – but it also generates 220GB of XML trace records spread across 18 threads.

With a single thread it’s easy to work out the reuse distance – you just look at how long it will be before a page gets referenced: you could even do this statically ahead of runtime and just the result up if you wanted.

That is not true in multithreaded code – one thread might run fast or low (eg while waiting for IO) and so the only way to do it is to measure the reuse distances for each page and for every thread:

  • For each thread calculate the minimum reuse distance of each page;
  • Then pick the page with the maximum minimum reuse distance across all threads.

I wrote some code that seemed to do this and on my testing subset of the 220GB XML it seemed to deliver good results. But whenever I ran it against the full trace it started brightly but then performance – by which I mean how fast the simulator ran through the traces in terms of the synthetic ticks generated by the simulator, or bluntly the simulated performance – just seemed to get worse and worse.

In fact the longer a simulated thread seemed to be running, the worse its performance got and the “fastest” thread was always the most recently spawned one, and the more threads that ran the worse this problem got.

Now, the combination of severely limited memory (in this case we were simulating a 16 core NoC with 32Kb local memory per core, which is pooled into one flat 512Kb space), performance can go downhill quite badly as the thread count climbs – though this fall off was catastrophic and it was plain that OPT was going to be worse than “least recently used” (LRU) – and that just cannot be correct! I have not sat down to write a mathematical proof of that but instinctively I know it to be true…

Reading through the code did not draw my attention to any obvious flaws, so I had to sit down and think about what the bug showed me – it worked well on short runs, the most recent thread seemed to do well and the recent threads in general did better than the longer established threads.

Even writing this down now makes it seem obvious – my code was in some way biased towards more recently created threads. And so instead searching through all the code looking for errors, I could instead home in on those parts of code that scanned through each thread.

I found such an error quite quickly but testing again showed that while the problem was diminished, it was not by much – I still had not found what was really to blame.

Another scan through the key sections of the code revealed the real error: when a thread tried to page in memory it only examined the reuse distances of itself and those threads created after it.

Thread data was stored in a linked list, but instead of starting the scan at the head of the list, the scan began at the pointer to the current thread. The result was that the most recent thread had a “perfect” OPT experience – on every page in its reuse distances were fully considered, but at the other extreme the first thread’s reuse distances were only considered when it came to page in memory – so the pages it used but were not used by any other thread were fair game – they appeared to have an infinite reuse distance and so were almost certainly the ones chosen for replacement more or less every time.

Fixing the code so that the scan began with the head of the linked list and not just the local pointer fixed the problem and the OPT simulator is now running – my guess is that it is going to show OPT to be two to three times more efficient than LRU.

 

Enhanced by Zemanta

Amazon getting their act together on Kindle quality?


Just received this email:

Hello,

We are happy to announce that an updated version of your past Kindle purchase of Making Embedded Systems: Design Patterns for Great Software by Elecia White is now available. The version you received had the following issues that have been corrected:

Significant editorial issues were present.

Before you update to the new version, check to see that all devices that you have used to read this book are connected to a network and that their Annotations Backup settings are turned on. This will ensure that your notes, highlights, bookmarks and furthest reading location are retained in the new version. For help with modifying settings, please visit http://www.amazon.co.uk/kindlesupport

You can get this new version by going to the Manage Your Kindle page at http://amazon.co.uk/mykupdate .Find the book in your Kindle Library and click on the “update available” link next to the book’s title. Within 5 minutes, any of your devices that have the eBook currently downloaded and have an active wireless connection will be updated automatically.

Alternatively, you can reply to this email with the word “Yes” in the first line of your response. Your e-mail response must come from the e-mail address associated with your Amazon account. We will update your book within 2 hours of receiving your email.

We thank you for your business with Amazon.
Sincerely,
Customer Service Department

I am away from my Kindle at the moment, so cannot verify how good the changes have been, but it is at least a positive sign. I still would not recommend using a Kindle for maths, science and computer science books though: this is just one update… and there are all the other, DRM-related, issues too.

From a real “product description” on Amazon


The invention of wheel was a major break throw of its scientific age. Like wise we can consider internet is the one of the most influential Scientifics break throw of current age, which was only possible because of advance ever growing communication technologies. Communication between any two sources protocols can be deemed as backbone of the whole setup. Currently several protocol exist and are employed in the cutting age of the communication set up. Different protocols are invented because of different scenario of communication. UDP protocol is actively used in today’s worlds of communication, in real time with data transfer which can tolerate little loss of data and accuracy. Because of ever increasing need and significance of communication system,in this book investigated the UDP software for Ethernet Lite in the embedded system by evaluated of behaviour of some important protocols UDP, IP, ICMP and ARP. One most important thing which we have done in this book that programme memory small as we can .The memory is 10KB with the UDP/IP.

This is the (publisher’s? author’s?) official product description of UDP/IP For Embedded System: Methods, Implementation, Benchmarks, Programming,FPGA, Embedded system.

I don’t think I will be buying it.

An NP-complete problem from the world of embedded computing


English: Euler diagram for P, NP, NP-Complete,...
English: Euler diagram for P, NP, NP-Complete, and NP-Hard set of problems. (Photo credit: Wikipedia)

First of all – a quick explanation of P and NP. The class of problems known as ‘P’ – for polynomial (as in they can be solved in a time which is dependent on a polynomial of their complexity) – are those for which a known algorithm – a sequence of mathematical steps – to solve them exists. For instance, solving (i.e., finding the value of x where the formula evaluates to zero) x – 2 is a P class problem. NP (not P) problems are much more interesting – these are problems for which an algorithm exists but which is unknown at the time the problem is posed. In the worst case the only way of solving the problem may be to try all the potential algorithms until we find the one that solves the problem. That said, once a potential solution is found it can be verified ‘simply’ (i.e. in polynomial time). It is not known if, in fact NP problems (such as drawing up school timetables or cracking internet public key encryption) are really P type problems after all and we just have not found the solution or are permanently excluded from ‘simple’ (P) solutions. A class of NP problems called ‘NP complete‘ are those that, if shown to really be P class problems, would indicate P=NP. Most, but not all, mathematicians think, in fact P!=NP.

So, here’s the problem. It sounds simple, but as it is NP, it’s not! (I got this from Making Embedded Systems: Design Patterns for Great Software)

You have a micro controller with a timer of fixed 4MHz frequency and two 8 bit registers, a and b, such that (a) counts ticks and (b) is a match register that triggers an interrupt when the count register matches the tick count stored and a 16 bit prescaler (that allows the scaling of the ticks e.g. – if set to 2 then twice as many ticks are required to trigger the interrupt).

So how can you set the match and prescaler to work for an arbitrary frequency? Sounds like it should be easily algorithmically managed, but it’s not.