First year as a software engineer


Today marks the anniversary of me starting work as a software engineer. I love the job – despite some of the real challenges I’ve faced in a radical career change – and I do feel so very lucky to have got it in an exceptionally difficult time.

Some of the changes are about how I see myself – after more than 30 years of working in communications and public policy I (regardless of what others thought about me or even whether I was right or wrong) was generally very confident in my own judgment and ideas. I’d been around the track – more than once – and whether you liked it or not I had a view I’d generally express. Now I am the start-of-career, not-long-out-of-university beginner. It can be daunting sometimes and I get things wrong, though my colleagues are generally happy to help (though everyone working remotely does sometimes make that a little harder).

Well, I’m not quite new to everything – I know how corporate things work and while I am nobody’s manager I do know what that is about too (or at least I think I do).

Secondly, I am very much working in an engineering environment and not a computer science one. The differences are subtle and I cannot quite articulate them, but they are certainly real. “Scientists” (whether computing scientists – applied mathematicians really – or hard scientists) and engineers do tend to look at each other a little warily, and before I’d always been on the “other” side of this. But I am getting used to this too.

Above all it’s great to work somewhere where every day I am expected to think and apply that thought to solve novel and interesting problems.

The problem with parallelisation


Just over 15 years ago a big change in general computing occurred – computer hardware more or less stopped getting faster.

The previous three decades of ever-faster computer hardware had been supported by the ability to etch ever-smaller transistors on silicon. This, together with an ability to use bigger silicon wafers leads to what is known as Moore’s Law – a doubling of the number of transistors on a chip (latterly) every 24 months or so.

But as each transistor used power and as faster computers use more power at a given voltage, the magic of ever-faster computing was also dependent on Dennard Scaling – namely that as transistors got smaller, their energy use also fell. But Dennard Scaling more or less ground to a halt at the start of the century and by 2003/4 computing manufacturers had to come to terms with the fact that while chips could still pack more transistors in, they couldn’t keep getting faster.

The implications of this have been profound. First of all the previously highly buoyant desktop computing market more or less fell off a cliff. If computers no longer doubled in speed every 24 – 36 months there was no need to keep buying new ones. (Smaller, lighter computers did become more viable and cheaper though – this is why laptops were for show-offs in the 20th century but are commonplace today.)

Instead of building ever-faster chips, the extra transistors appear as additional processors (cores) on the same chip: the idea being that if we can break computing tasks down into parallel sub-tasks then we can still get better performance from our systems if they use multiple cores running at slower speeds instead of one big core running at a super-fast speed. Of course, in any case, we can’t actually get that super-fast speed as the chip uses too much energy, gets too hot and then is in danger of ceasing to function properly at all.

But concurrency is hard – it’s very difficult to write efficient parallel code, especially because of what is known as Amdahl’s Law – which essentially says it doesn’t matter how much parallel code you have if you keep having to execute significant bits on a single processor and (and this is the killer for multi-core systems) that single processor is now slower because of your multi-core design.

Parallel code needed for multi-core speedup

For my PhD I made a projection (using a formula found in this paper) for just how parallel code had to be to get better performance (see above, where f is the fraction of code that is parallel) and the results are sobering. For code that is 99.9% parallel then using 1000 processors (each of which is about 250 times slower than the one faster chip they collectively replace) we can double the speed, more or less. But if the code was only 99% parallel, far from doubling the effective speed of execution, we end up more than halving it. And 99% parallel code is very, very difficult to do – imagine running 1000 tasks in parallel but only spending 1% of all time co-ordinating the processes.

The difficulty is compounded by the fact that even if the cores have multiplied then the other systems on the computer have not – generally speaking we still have one memory and one storage device and so cores have to queue to access these (this is, essentially, what my PhD was about).

When I started the PhD as a part-time student in 2012 the firm expectation in industry was that we would by now be well into the era of 1024-core chips. That simply hasn’t happened because, at least in part, there is no firm commercial reason for it: even if processors with that many cores could be mass-produced (and they probably could), they would actually be slower in the real world than processors with many smaller numbers of cores in all but some specialised domains.

(For those that are interested I expect the PhD thesis to be online shortly, I’ll post a link when it is.)

Addendum: A few people have questioned where the figure for “250 times slower” comes from – someone even accused me of making it up! In the referenced paper there is a formula for the Pareto frontier for core performance – in other words the limit of maximal efficiency. This is where the factor for individual core slowdown comes from – in fact the 250 figure relates to 1024 cores but the 1000 core figure would be similar. Simple maths shows that the maximum speed this assembly could manage is 4 times greater than a single high-speed core but that depends on having 100% parallel code.

It’s also worth noting that this formula is based on a particular node of chip manufacture – the so-called 45nm (nanometre) node (the node number used to relate to the typical size of components etched on silicon but is now just a generic term for a particular manufacturing/fabrication process). Leading edge chips are now down to the 7nm node so higher degrees of efficiency are (probably) possible as a result: but I haven’t got a formula to calculate those and the basic principle – that we need highly parallel code to get the most from many-core designs – doesn’t alter.

I love binary trees


A few years ago I was very proud of myself for managing to translate the original Pascal of the Reingold-Tilford algorithm into C/C++ and every since then I have had a fascination for drawing binary trees.

NoC with binary memory tree

 

The one drawn here was actually done “by hand” in Metapost – I realised only as I finished it that I could have or maybe should used an algorithmic approach and not a heuristic (“does it look ok?”), though Metapost is a pretty inflexible programming tool (though as this book shows can produce some beautiful results) and I reckon a decent tree drawing algorithm in that ought to be worth at least an MSc.

A fractal in the real world


Of course, real world fractals do exist, but it is interesting to appear to have “discovered” one, or at least a pattern which certainly shows many fractal-like properties.

I am plotting the relative efficiency of different page replacement algorithms – and here I am looking at Belady’s OPT (or MIN), or as it is sometimes called, the clairvoyant page replacement algorithm.

Here I am measuring the proportion of time a page stands idle (including its loading time). So a page with a ratio of 1.0 is loaded but never used, a page with a ratio of 0.999 is used one-thousandth of the time it is in the system and a page with a ratio of 0.99 is used one-hundredth of the time it’s in the system and so on.

So, here’s the “full” plot for my data:

Full OPT plot

Not much to see – except most pages are idle most of the time: so we go closer:

90%

And closer…

0.999%

And closer…

0.9999%

That’s about the limit of the precision of the data, but it looks pretty fractal like to me…

(Should add, other page replacement policies studied don’t appear to show this fractal-like pattern, exhibiting only one peak/modal frequency.)

Procrastination time: what does this mean?


I am supposed to be writing some more stuff for my PhD thesis, but this feels like a more legitimate way to procrastinate than others…

I have plotted the performance of a model of a multi-core processor system and, because it was something to do, applied a Fourier Transform to the data (the black plots in first chart):

2k LRU performance

So, my question is, what does the second chart mean? If it means anything of course.

I get that the real component is plotted on the x-axis and the imaginary component on the y-axis, but is there any information conveyed in the chart?

A very first draft…


Our proposal here is to meet this need to use a memory space much greater than the locally available low-latency storage head on. The problems faced by programmers of today’s many- and multi-core devices are, in a sense, far from new: in the 1960s an earlier generation of programmers, experimenting with virtual memory and time-sharing computers, and using programs generated by the first generation of high-level languages were struck down by the phenomenon of thrashing, where computing time was lost to excessive memory management [23]. The solutions we propose here draw on the lessons learned at that time, but significantly adapt the methods to account for the realities of using multicore devices with very limited local storage.

We propose to enable virtual memory [24] for such systems, using a novel form of paging which, instead of transferring whole pages across the memory hierarchy, concentrates on a minimal transfer of partial pages [50]. Here hard faults still impose a significant penalty of lost cycles but the cost of paging in and out of memory is amortised across multiple memory reads and writes. This avoids unnecessary transfer and limits congestion in the memory interconnect, which we find to be a significant determinant of program performance.

One of the lessons of the 1960s was that maintaining a program’s working set in low latency memory was the most effective way of limiting paging and thrashing [22]. Much research at that time and into the 1970s and later concentrated on effective ways to maintain the working set of pages in fast memory and innovation continued into the 1990s (and beyond), for instance with the 2Q algorithm in Linux [38]. Here, though, we establish that better performance in embedded and real-time manycore systems is unlikely to come from a better page replacement algorithm but from a reorientation towards better management of the internals of the pages.

General computing page replacement algorithms are generally robust [49] and with the decreasing cost of memory, much recent research has concentrated on using large page sizes, not least to minimise TLB misses [9], we return to earlier research findings which emphasise the efficiency of small pages on small systems and we show that combining this with a FIFO page replacement policy may, perhaps counter-intuitively, deliver more time predictable or even faster performance than attempts to use some version of a least-recently used policy, though we also show that choice of page replacement algorithm is likely to be strongly related to the variation in working set size over a program’s lifecycle.

Concerns about power have always been important for embedded devices, but changing characteristics of hardware mean new issues such as “dark silicon” [28] will be important as we move deeper into the coming manycore era. We demonstrate here that constraints in power may be, to some extent, mitigated by a loosening of bottlenecks elsewhere in a realtime system, particularly in the memory interconnect: in a system where power is no constraint then congestion here dominates performance, where power concerns limit processor activities they may also lessen congestion.

Further we demonstrate results that some frequency scaling in manycore systems may not limit performance if it contributes to lessening congestion in the memory interconnect: results that suggest that the dark silicon issue may not be as big a barrier to performance in real-world applications as might be expected.

Reviving this blog … with a question about binary trees


Work changes and a determination to actually finish my PhD mean I really should make a bit of an effort here and so I will.

Here is a puzzle that has been bothering me about binary trees which has come from my PhD research…

In that research I am investigating how to implement virtual memory on a many-core Network-on-Chip (NoC) system. Essentially I have been building and running emulators.

The core of the emulation is the “memory tree” – the connection between the cores of the NoC and the global memory controller. The “Bluetree” memory tree is a binary tree arrangement, with the root node (level 0) connecting to the memory controller and the leaves at the NoC tiles (with each tile having a single processor).

At every non-leaf level of the tree there are multiplexors (mux) with a buffer that can hold a single memory packet – so if two requests for global memory arrive at once and the buffer is free there needs to be an arbitration process to decide which packet gets to occupy the buffer and which has to wait.

We have 128 leaf nodes – as in this tree…

Binary tree with 128 leaf nodes

With this many leaf nodes, the question of the arbitration policy of the muxes is important. A naïve approach would be to say, if two packets contest for the buffer, let the packet on the left win the first time and then pass the packet on the right: after all this guarantees no starvation.

But it is actually a hopeless policy … the leftmost leaf node (shall number from zero) faces no opposition, the rightmost (number 127) loses out to every other node.

But number of leaf node alone is not a sufficient way of describing the relative number of blocks – for instance leaf node 16 appears to be in a better place than leaf node 15 – as going up the tree leaf node 15 can be blocked at level 6 (the root node is at level 0), at level 5, level 4, level 3 and level 2: while leaf node 16 is only blocked at level 1.

Practical testing of this naïvely fair approach in the emulator gives results like the following… as can be seen the right hand part of the tree performs appallingly but even there lots of variation can be seen:

some test results

My question is this: what is the correct (i.e., correct mathematically) way of describing the order of the leaf nodes?

Your assembly is my domain-specific language


A few years ago I set out to write a clone of Sinclair BASIC as a domain-specific language (DSL) in Groovy. The end result was BINSIC, but it was much closer to an interpreter than a DSL: it turned out that strange ways that Groovy handled capitalised function/closure names meant that I could not match keywords at all and so interpretation became the only option (though there were DSL-like features lurking under the covers).

Now I have set out to write an interpreter for RISCV assembly and have written something which is much more like a DSL. My aim was to track the memory reference patterns of various real-time benchmarks running on RISCV systems – based on the disassembly generated by Spike, the RISCV emulator.

Getting that done requires tracking register state – because in true RISC fashion reads and writes are done not to immediates, but to addresses held in registers with immediates used as offsets.

To make this happen every (or at least every used) RISCV instruction is mapped to a closure and the operands treated as closure parameters. The closures are all inside the RegisterFile class so all can access the registers to keep the state updated. This makes it quite like a DSL but I make no claims for purity: every statement is passed through a regular expression based interpreter to separate out the parameters: if nothing else that eliminates a lot of boilerplate code that would have to be stuck inside the closures.

Memory is treated as a sparse array through a Groovy Map instance – with an address (64-bit Long) mapped to an 8-bit Byte.

The process isn’t perfect – a disassembly doesn’t given you the contents of initialised data, so an attempt to access those addresses is quite likely to raise an exception which needs to be trapped and the memory accounted for – by simply assigning zero to the address: it’s a kludge but it seems to work.

Delays in a binary memory tree queue


This is a bit niche, but I spent a fair bit of this weekend working out what the algorithm to calculate how long a memory request packet would take to traverse a binary tree (from a leaf to the root) was. And now I have written a Groovy script (testable in your browser at https://groovyconsole.appspot.com/script/5109475228778496 – click on ‘edit’ and then ‘run’) to calculate the results.

(I am sure that this has been done before – many times – but I couldn’t find the algorithm on a quick search and then the desire to solve this myself took over).

The problem is this: memory request packets enter the tree at a leaf, they then have to cross a series of multiplexors until they reach the root (which is where the tree interfaces with the memory supply). Each multiplexor (mux) has two inputs and one output, so taken together the muxes form a binary tree.

As request packets could arrive at the leaves of a mux at the same time, there has to be a decision procedure about which packet progresses and which is delayed. The simple choice is to favour packets on either the left or the right leaf (in my case I went with the right). The question is then what is the average and maximum delay for a request packet.

So, if you muck about with the script you’ll see that the absolute maximum delay on a 256 leaf tree (eg., on a Bluetree memory tree connected to a 256 tile NoC) is 495 cycles, while the average maximum delay (ie for a leaf in the middle of the tree) is 248 cycles.

In contrast if the load is just 1% then these figures fall to about 3 and 1.5.

There is a flaw here though – because this all assumes that no new packets enter the system while the original request packet is traversing it – but in a dynamic system, even with a load as low as 1%, this is highly unlikely – packets that enter the tree “to the right” of the original request could potentially add to the delay if they chain with packets already present.

Springer and Scala


I need to learn the Scala programming language. Unfortunately the University of York’s library does not seem to have an electronic copy of “Programming in Scala” (Amazon link) but it did have an electronic copy of  Springer’s “A Beginner’s Guide to Scala” (Amazon link). But this is a very poor book, referring back to examples that don’t exist, offering a very confusing explanation as to what object-orientation is and, to cap it all, illustrating examples with pseudo code rather than Scala itself.

Of course, it’s not the worst example you’ll find from Springer – “Unleash the System on Chip using FPGAs…” is easily the worst book I have ever seen in print (see my review on Amazon here) – and, of course they also publish many useful books and conference proceedings and so on. But they appear to have close to a monopoly on many aspects of computer science publications and use that ruthlessly.

If it wasn’t for other factors I’d suggest the European Commission needs to take a close look at them. Hardly worth it in the UK’s case these days though.