## Sometimes, admitting defeat is the very best thing you can do

I have spent the last month and half or so writing a particular computer program to model how some code would run on a 16 core “network-on-chip” device.

There were a lot of problems to get over – for although I had already written a Groovy/Java program to do the same thing, that code just was not up to handling the additional level of complexity I wanted for this iteration, so the whole thing had to be redone in C/C++.

About three weeks ago I got to the point where the code compiled and it should have been doing what I wanted, but it was crashing with memory bugs, a lot.

Debugging multi-threaded Posix Threads code is not meant to be easy but I was able – after teaching myself the basics of the GNU debugger (GDB) – to make a lot of progress (having realised that the old programmer’s saw of “if God had wanted us to use debuggers, she wouldn’t have given us printf” does not raise many laughs when you are dealing with multi-threaded code).

I discovered that the bugs were in my four year old red-black tree code. I had used it before in many situations and never had a problem, but also realised that where I used it before (and where I am still using it now), I was continually adding nodes to the tree and not deleting them. The deletion code was wrong.

Most people write their red-black tree code having read Introduction to Algorithms – but here, at least, I was not like most people. I had written the code myself more or less from scratch, piecing together the various things I needed to do from a lot of different online sources. And I was very proud of that too.

But that also meant that, when I tried to fix the code by looking at the canonical description in “Introduction…” the match was poor and my model was, at a quite fundamental level, different (I used null pointers for the leaves, not the “guard node” approach advocated in the book.)

I thought I could get around the issue by not doing any rebalancing after a deletion – after all I was still inserting nodes even as I was deleting them and my insertion code seems (and still seems) quite robust.

But while that allowed by code to run quite a lot longer before falling over, it still fell over. I had to face the facts and use the Standard Template Library (STL) classes – map and set.

I started work on that on Friday on a long train journey. I thought it would be weeks of further chopping and changing. In fact it took little more than 48 hours and I am wondering why I put myself through such misery when robust and well-tested solutions were always to hand (I kept my code for one tree I used where I am only doing insertions – as converting that would have taken a further day, but that was the margin I was operating in – having contemplated weeks, I now worried it would take more than a few hours).

The new code seems pretty robust, but I am struggling to get it on the university’s compute server as that is now under such heavy load, and so I am thinking of adding an nCurses interface while I wait.

Admitting defeat was the quickest way, it seems, to victory. Something worth remembering, perhaps, when your software project is in trouble.

## From two years to two hundred minutes

I rewrote my parsing code in C/C++ and a task that was likely to take two years in Groovy/Java while running 18 threads – on a multicore machine with a large amount of memory – (I am not exaggerating about the turnaround time, either) completed in not much more than three hours in a single thread on the same machine.

(The new code has a C main – but the bulk of the code is in C++ which is accessed through an extern “C” interface.)


#include <iostream>
#include <cstdio>
#include "redblack.hpp"

using namespace std;

//acm538@york.ac.uk
//C wrap around C++ redblack tree code
//Licensed under GNU GPL verison 2
//or any later version at your discretion

class pageinst
{
private:
long instruction;
pageinst* next;

public:
pageinst(long inst);
long getinst(void);
pageinst* getnext(void);
void setnext(pageinst* newinst);
};

pageinst::pageinst(long inst)
{
instruction = inst;
next = NULL;
}

long pageinst::getinst(void)
{
return instruction;
}

pageinst* pageinst::getnext(void)
{
return next;
}

void pageinst::setnext(pageinst* newinst)
{
next = newinst;
}

class pagechain
{
friend ostream& operator<<(ostream& os, pagechain& pc);

private:
long page;
pageinst* tail;

public:
pagechain(long where);
long getpage(void);
bool operator==(pagechain&) const;
bool operator<(pagechain&) const;
pageinst* gettail(void);
void settail(pageinst* newtail);
};

ostream& operator<<(ostream& os, pagechain& pc)
{
os << pc.page;
return os;
}

long pagechain::getpage(void)
{
return page;
}

bool pagechain::operator==(pagechain& pc) const
{
return (page == pc.page);
}

bool pagechain::operator<(pagechain& pc) const
{
return (page < pc.page);
}

pagechain::pagechain(long where)
{
page = where;
tail = NULL;
}

{
}

{
if (!gettail())
}

pageinst* pagechain::gettail(void)
{
return tail;
}

void pagechain::settail(pageinst* newtail)
{
pageinst* oldtail = gettail();
tail = newtail;
if (oldtail)
oldtail->setnext(newtail);
}

void killchain(pageinst* pi)
{
if (pi == NULL)
return;
pageinst* next = pi->getnext();
delete pi;
killchain(next);
}

void killtree(redblacknode<pagechain>* node)
{
if (node == NULL)
return;
killtree(node->left);
killtree(node->right);
}

void writechain(pagechain* pc, FILE* fout)
{
if (pc == NULL)
return;
fprintf(fout, "PAGE: %li ", pc->getpage());
while (pi) {
fprintf(fout, ",%li", pi->getinst());
pi = pi->getnext();
}
fprintf(fout, "\n");
}

void writeoutpages(redblacknode<pagechain>* node, FILE* fout)
{
if (node == NULL)
return;
writeoutpages(node->left, fout);
writechain(&node->getvalue(), fout);
writeoutpages(node->right, fout);
}

extern "C" {

void* createtree(void)
{
redblacktree<redblacknode<pagechain> >* tree;
tree = new redblacktree<redblacknode<pagechain> >();
return static_cast<void*>(tree);
}

void removetree(void* tree)
{
redblacktree<redblacknode >* rbtree;
rbtree = (static_cast<redblacktree<redblacknode<pagechain> >* >(tree));
killtree(rbtree->root);
delete rbtree;
tree = NULL;
}

void* getroot(void* tree)
{
redblacktree<redblacknode<pagechain> >* nodetree =
static_cast<redblacktree<redblacknode<pagechain> >*>(tree);
return static_cast<void*>(nodetree->root);
}

void insertinstruction(long pagenumber, long instruction,
void* tree, void* root)
{
redblacknode<pagechain> *rootnode, *pagenode, *dummynode;
redblacktree<redblacknode<pagechain> >* nodetree;

pagechain dummychain = pagechain(pagenumber);
rootnode = static_cast<redblacknode<pagechain>*>(root);
nodetree = static_cast<redblacktree<redblacknode<pagechain> >* >(tree);

dummynode = new redblacknode<pagechain>(dummychain);
pagenode = nodetree->locatenode(dummynode, rootnode);
pageinst* nextinstruction = new pageinst(instruction);

if (pagenode) {
pagenode->getvalue().settail(nextinstruction);
delete dummynode;
} else {
dummynode->getvalue().settail(nextinstruction);
nodetree->insertnode(dummynode, rootnode);
}
}

void writeinorder(void* tree, FILE* fileout)
{
redblacktree<redblacknode<pagechain> >* nodetree =
static_cast<redblacktree<redblacknode<pagechain> >*>
(tree);

writeoutpages(nodetree->root, fileout);
}

} //end extern "C"



The old code used Groovy’s Map class to index a sparse array, the new code uses a red-black tree to index the same data. The speed up is not just because of the tree’s superior performance, but it surely has a lot to do with it.

I now have to decide whether to re-write the whole of the testing code in C/C++ … I think I am going to have to.

## Things learned this week: 7 February 2014

For a long time I have wanted to download, for testing, OSX Mavericks on to a machine at work. And I have not been able to do so, the download always freezing at 995MB.

BT acknowledge that their equipment is crippled in this way – they claim it is a security feature and even said to me that “two years ago downloads weren’t this big”. As a result they may (as in my case) simply refuse to replace the equipment and instead try to get you to reset the hardware on a per download basis.

But you can get round this – download “Hotspot Shield” (it is free) – a VPN product – and use it to get you over the 995MB hump.

2. C++ templates are powerful

Four years ago I wrote some C++ code to generate a red-black tree using templates. The code has been mouldering on Github ever since, but now I have (hopefully) found a completely new use for it – because the template design allows me to repurpose it for something else. Code reuse in action!

## The Reingold-Tilford algorithm revisited

A while ago, as part of early research into what became my MSc project, I wrote code to create and then draw red-black trees using C++.

To draw the trees I used the venerable Reingold-Tilford algorithm, which is more or less the standard approach. I wrote some blogs about it and pages here seem to come pretty high up in Google searches for the algorithm, so I get passing traffic regularly as a result.

But idly chasing these links has led me to a chapter from the forthcoming Handbook of Graph Drawing and Visualization edited by Roberto Tamassia which has a chapter on tree drawing by Adrian Rusu, which contains bad news for us Reingold-Tilford fan boys, as this summary from the book of an experiment comparing algorithmic performance shows (emphasis added):

• The performance of a drawing algorithm on a tree-type is not a good predictor of the performance of the same algorithm on other tree-types: some of the algorithms perform best on a tree-type, and worst on other tree-types.
Reingold-Tilford algorithm [RT81] scores worse in comparison to the other chosen algorithms for almost all ten aesthetics considered.
• The intuition that low average edge length and area go together is contradicted in only one case.
• The intuitions that average edge length and maximum edge length, uniform edge length and total edge length, and short maximum edge length and close farthest leaf go together are contradicted for unbalanced binary trees.
• With regards to area, of the four algorithms studied, three perform best on diﬀerent types of trees.
• With regards to aspect ratio, of the four algorithms studied, three perform well on trees of diﬀerent types and sizes.
• Not all algorithms studied perform best on complete binary trees even though they have one of the simplest tree structures.
The level-based algorithm of Reingold-Tilford [RT81] produces much worse aspect ratios than algorithms designed using other approaches.
• The path-based algorithm of Chan et al. [CGKT02] tends to construct drawings with better area at the expense of worse aspect ratio.

## Working set heuristics and the Linux kernel: my MSc report

My MSc project was titled “Applying Working Set Heuristics to the Linux Kernel” and my aim was to test some local page replacement policies in Linux, which uses a global page replacement algorithm, based on the “2Q” principle.

There is a precedent for this: the so-called “swap token” is a local page replacement policy that has been used in the Linux kernel for some years.

My aim was to see if a local replacement policy graft could help tackle “thrashing” (when a computer spends so much time trying to manage memory resources – generally swapping pages back and forth to disk – it makes little or no progress with the task itself).

The full report (uncorrected – the typos have made me shudder all the same) is linked at the end, what follows is a relatively brief and simplified summary.

Fundamentally I tried two approaches: acting on large processes when the number of free pages fell to one of the watermark levels used in the kernel and acting on the process last run or most likely to run next.

For the first my thinking – backed by some empirical evidence – was that the largest process tended to consume much more memory than even the second largest. For the second the thought was that make the process next to run more memory efficient would make the system as a whole run faster and that, in any case the process next to run was also quite likely (and again some empirical evidence supported this) to be the biggest consumer of memory in the system.

To begin I reviewed the theory that underlies the claims for the superiority of the working set approach to memory management – particularly that it can run optimally with lower resource use than an LRU (least recently used) policy.

Peter Denning, the discoverer of the “working set” method and its chief promoter, argued that programs in execution do not smoothly and slowly change their fields of locality, but transition from region to region rapidly and frequently.

The evidence I collected – using the Valgrind program and some software I wrote to interpret its output, showed that Denning’s arguments appear valid for today’s programs.

Here, for instance is the memory access pattern of Mozilla Firefox:

Working set size can therefore vary rapidly, as this graph shows:

It can be seen that peaks of working set size often occur at the point of phase transition – as the process will be accessing memory from the two phases at the same time or in rapid succession.

Denning’s argument is that the local policy suggested by the working set method allows for this rapid change of locality – as the memory space allocated to a given program is free to go up and down (subject to the overall constraint on resources, of course).

He also argued that the working set method will – at least in theory – deliver a better space time product (a measure of overall memory use) than a local LRU policy. Again my results confirmed his earlier findings in that they showed that, for a given average size of a set of pages in memory, the working set method will ensure longer times between page faults, compared to a local LRU policy – as shown in this graph:

Here the red line marks the theoretical performance of a working set replacement policy and the blue line that of a local LRU policy. The y-axis marks the average number of instructions executed between page faults, the x-axis the average resident set size. The working set method clearly outperforms the LRU policy at low resident set values.

The ‘knee’ in either plot where $\frac{dy}{dx}$ is maximised is also the point of lowest space time product – at this occurs at a much lower value for the working set method than for local LRU.

So, if Denning’s claims for the working set method are valid, why is it that no mainstream operating system uses it? VMS and Windows NT (which share a common heritage) use a local page replacement policy, but both are closer to the page-fault-frequency replacement algorithm – which varies fixed allocations based on fault counts – than a true working set-based replacement policy.

The working set method is just too difficult to implement – pages need to be marked for the time they are used and to really secure the space-time product benefit claimed, they also need to be evicted from memory at a specified time. Doing any of that would require specialised hardware or complex software or both, so approximations must be used.

“Clock pressure”

For my experiments I concentrated on manipulating the “CLOCK” element of the page replacement algorithm: this removes or downgrades pages if they have not been accessed in the time been alternate sweeps of an imaginary second hand of an equally imaginary clock. “Clock pressure” could be increased – ie., pages made more vulnerable to eviction – by systematically marking them as unaccessed, while pages could be preserved in memory by marking them all as having been accessed.

The test environment was compiling the Linux kernel – and I showed that the time taken for this was highly dependent on the memory available in a system:

The red line suggests that, for all but the lowest memory, the compile time is proportional to $M^{-4}$ where $M$ is the system memory. I don’t claim this a fundamental relationship, merely what was observed in this particular set up (I have a gut feeling it is related to the number of active threads – this kernel was built using the -j3 switch and at the low memory end the swapper was probably more active than the build, but again I have not explored this).

Watermarks

The first set of patches I tried were based on waiting for free memory in the system to sink to one of the “watermarks” the kernel uses to trigger page replacement. My patches looked for the largest process then either looked to increase clock pressure – ie., make the pages from this large process more likely to be removed – or to decrease it, ie., to make it more likely these pages would be preserved in memory.

In fact the result in either case was similar – at higher memory values there seemed to be a small but noticeable decline in performance but at low memory values performance declined sharply – possibly because moving pages from one of the “queues” of cached pages involves locking (though, as later results showed also likely because the process simply is not optimal in its interaction with the existing mechanisms to keep or evict pages).

The graph below shows a typical result of an attempt to increase clock pressure – patched times are marked with a blue cross.

The second approach was to interact with the “completely fair scheduler” (CFS) and increase or decrease clock pressure on the process lease likely to run or most likely to run.

The CFS orders processes in a red-black tree (a semi-balanced tree) and the rightmost node is the process least likely to run next and the leftmost the process most likely to run next (as it has run for the shortest amount of virtual time).

As before the idea was to either free memory (increase clock pressure) or hold needed pages in memory (decrease clock pressure). The flowchart below illustrates the mechanism used for the leftmost process (and decreasing clock pressure):

But again the results were generally similar – a general decline, and a sharp decline at low memory values.

(In fact, locking in memory of the leftmost process actually had little effect – as shown below:)

But when the same approach was taken to the rightmost process – ie the process that has run for the longest time (and presumably may also run for a long time in the future), the result was a catastrophic decline in performance at small memory values:

And what is behind the slowdown? Using profiling tools the biggest reason seems to be that the wrong pages are being pushed out of the caches and  need to be fetched back in. At 40MB of free memory both patched and unpatched kernels show similar profiles with most time spent scheduling and waiting for I/O requests – but the slowness of the patched kernel shows that this has to be done many more times there.

There is much more in the report itself – including an examination of Denning’s formulation of the space-time product  – I conclude his is flawed (update: in fairness to Peter Denning, who has pointed this out to me, this is as regards his approximation of the space-time product: Denning’s modelling in the 70s also accounted for the additional time that was required to manage the working set) as it disregards the time required to handle page replacement – and the above is all a (necessary) simplification of what is in the report – so if you are interested please read that.

Applying working set heuristics to the Linux kernel

## Red-Black tree generated by memball and treedraw

This is a (much reduced in scale) red-black tree of processes running on my main desktop, ordered by allocated memory (produced by my memball and treedraw programs – so the tree is structured using the Reingold-Tilford algorithm).

If you count you will see that the path from the root (the little black ball at the top) to the leaves  – even smaller black dots at the bottom – always passes through five intervening black balls.

The graphic is a PNG produced from the very large (in scale) SVG. This is the far (highest memory use) end:

And here are the smallest memory users:

## Red black tree resources and sources

My earlier article about red-black trees seems to have attracted some interest, so I thought I’d do another one just listing some resources.

The most cited text on this issue is Introduction to Algorithms though you don’t need it to implement your own red-black tree. Some useful internet sources are:

This animation gives you a good idea about how your tree should work.

And this video (based on the book) is pretty essential viewing.

Lecture 10: Red-black Trees, Rotations, Insertions, Deletions – Erik Demaine

If you want sources:

My C++ implementation of a red-black tree (GPL licensed)

The C++ standard template library offers various red-black tree based container classes: explained in this Dr Dobb’s Journal article.

The Linux kernel has a C implementation of a red-black tree – read here for more about that. The implementation can be found in the kernel sources at /include/linux/rbtree.h

## Red-black trees

Binary trees are seen and used frequently in computing science and computing. They are a good abstraction for many naturally occurring relationships (most of our mathematics is based on binary operations, for instance) and have O(log n) complexity (ie if you went from searching a tree of 1000 elements to a tree of 100,000 elements then the search should not take 100 times longer but about 10.)

Of course that log n goodness requires the tree to be “balanced” ie for any given node there should be roughly equal numbers to the left and the right. One way of doing this is through a “red black tree” – here nodes in the tree are assigned a colour: the root is always black. The rule is that any traversal from the root to the leaves should always go through an equal number of black nodes and to ensure this is possible red nodes may be inserted in the tree, but no red node may have another red node as an immediate descendant. (A full explanation is in Introduction to Algorithms though one can also work off various explanations on the internet, though they tend to be less than complete.)

The Linux kernel natively implements a red black tree (in C) and the first bit of work I started on my MSc project was to write my own, userland, implementation so I could see processes in the same way the kernel did.

As I had got a less than glorious mark (still a pass, so that’s what counts) in the C++ exam last year I also decided that I would write this in C++ using templates. A few days after I started I discovered that actually the writers of the STL had got there before me, but as this was an academic exercise I ploughed on with my own implementation.

Essentially this is what I did on my two week summer holiday in Scotland last year! When I was there I also started (though completed when I got home) a couple of helper applications to position the tree according to Reingold and Tilford’s algorithm (which I had to translate from PASCAL) for “better drawing of trees” and a Qt application to display it all.

In fact I had a nagging problem with the Reingold-Tilford algorithm which I finally got around to fixing last night.

(Interestingly the code also allows you to use the Turing-complete capabilities of LaTeX by specifying a TeX output that uses LaTeX’s own positioning algorithm – something I picked up from The LATEX Graphics Companion – that is what the example shown above uses, though unfortunately for even moderately loaded systems the LaTeX processor objects to the width of the output).

Fancy trying it? I could do with someone giving it a bash on a BSD system – not needed for my course but interesting none the less.

The code is all at GitHubhttp://github.com/mcmenaminadrian: memball gives the basic GraphML or TeX or plaintext output, treedraw will convert the GraphML to an SVG or serialiszed stream using Reingold and Tilford’s algorithm and treeqt will use Qt to display the tree using the serialized class. You may have to download various libraries to get it to work (certainly the libproc-dev package on Ubuntu/Debian) – I couldn’t get memball to work on a Windows machine using Cygwin but maybe even that is fixable.

There is a script in the treeqt repo to make it easier: download the sources form all three repos, build them and then run:

./setup | ./treeqt --r 1