In praise of Roger Penrose

Roger Penrose has been awarded a share in the Nobel Prize for physics and I could not be more pleased.

It is not that I have met him or even attended a lecture by him and nor do we even see him much on TV – but I owe him a debt for his insight and also for his ability to educate and inform.

A few years ago I bought his “Fashion, Faith and Fantasy” – his stern critique of what he sees as the fashion of string theory and assorted similar ideas. I’m not going to pretend it’s an easy book, and much of it was well over my head – but it is still choc-full of insight. From it I went back to an older copy of “Cycles of Time”. This is a shorter book and was much more heavily marketed as for the “lay reader” but, actually, I found much of it much harder to get to grips with – if you don’t really understand conformal geometry (and I didn’t) it’s pretty hard to follow. But, and this is a big but, it has an absolutely brilliant explanation of entropy in its opening chapters and if, like me, you never really got to grips with this subject (because, I maintain, it was so poorly explained) as an undergraduate, this really cuts through the fog.

It mattered to me because its discussion of phase spaces gave me an insight into the stochastic nature of memory management systems and the way in which entropy can help us explain latency in computer systems. I wrote quite a bit about that in my PhD thesis and I am convinced it’s a subject that would repay a lot more investigation. Sir Roger collected an additional citation for that (not that he needed it of course).

Only last night I again made an attempt on the massive “The Road to Reality” – previously I’ve never managed to get past the introductory discussion of non-Euclidean geometry in chapter two, but I feel a bit more confident about that now, so giving it another go.

What’s the evidence your numbers are faked?

PhD poster

Currently reading: Ten Great Ideas About Chance.

It can be a bit of a frustrating read – a lot of the chapters about gambling and judgement appear to be poorly explained to me – especially if you’ve ever actually placed a bet on anything (I am far from a regular gambler – I’m not sure I’ve placed a bet on anything in the last decade but I still know how it works).

But it also has lots of exceptionally interesting stuff: not least the revelation that naturally occurring number sets tend to begin with 1 with a proportion of 0.3 whilst fraudsters (including election fixers and presumably scientific experiment fixers) tend to use such numbers 0.111… (1/9) of the time.

So, taking my life in my hands – what about results in my PhD – (read thesis here). I am actually calculating this as I go along – so I really do hope it doesn’t suggest fraud!

Anyway – first table, which has a calculated SPECmark figure based on other researchers’ work – so not really my findings, though the calculations are mine.

Numbers are: 35, 22.76, 13.41, 7.26, 3.77, 1.94, 1.01, 0.55, 0.31, 0.20 and 0.14

And (phew) that means 4 out of 11 (0.364) begin with 1 and so looks fraud free on this test.

But as I say, those are my calculations, but they are not my findings at root.

So moving on to some real findings.

First, the maximum total length of time it takes for a simulation to complete a set of benchmarks in simulated cycles: 43361267, 1428821, 1400909, 5760495, 11037274, 2072418, 145291356, 5012280.

Here a whole half are numbers that begin with 1. Hmmmm.

The mean time for the same: 42368816, 1416176, 1388897, 5646235, 10964717, 2026200, 143276995, 4733750.

Again, half begin with one.

What about the numbers within the results – in this case the amount of cycles lost due to waiting on other requests?

The numbers are: 35921194, 870627, 844281, 4364623, 1088954, 1446305, 110996151, 3420685

Now the proportion has fallen to 3/8 (0.375) and I can feel a bit more comfortable (not that the other results suggested I was following the fraudsters’ one-ninth pattern in any case.)

Later I produce numbers for an optimised system. How do they perform?

The maxima now become: 2514678, 1357224, 1316858, 3840749, 10929818, 1528350, 102077157, 3202193.

So the proportion beginning with 1 has actually risen to 5/8.

And the means show a similar pattern. Worse news with the blocks though. They now become: 730514, 775433, 735726, 2524815, 806768, 952774, 64982307, 1775537. So I am left with 1/8 (0.125) starting with 1 – something dangerously close to 1/9.

Can I save myself? I hope so … the figures above are for the first iteration of the system, but when we look at subsequent iterations a different pattern (for the blocks) emerges: 130736, 0, 0, 1612131, 97209, 232131, 64450433, 1117599.

This is now back to 3/8.

Of course, I didn’t cheat and I also suspect (I would, wouldn’t I?) that the block count is a better measure of whether I had or not – because the overall execution time of the benchmarks is, in some sense, a function of how long their execution path is and effectively that is determined – the blocking in the system is an emergent property and if I faked the whole thing I’d have less to go on and be much more likely to make it up.

Well, that’s my story and I’m sticking to it.

Flirting with disaster

A week ago, in the middle of the Christmas holidays, I contemplated how I could tie all the bits I had written for my PhD thesis into a coherently and seamlessly argued complete first draft.

I even bought a book about completing a PhD, In the end I decided I needed to add in one more set of experimental data – this time examining how highly parallel code might work in a many-core environment.

The results I had were over two years old and the software simulation I used didn’t quite match the code used later (the later code was a development of the earlier stuff) but I thought if I made that clear, there wouldn’t be a problem.

But a quick test of the model showed that what I had produced before was just too flakey, especially when I decided what I really had to do was compare the potential results of a system with 256 cores with a system with, say, 64 cores: the system with a lower number of cores cannot tackle the whole problem at once, but maybe the combination of faster cores, shorter queues for memory and less idle time (as processors will have to be set to tackle the unsolved problems) might still deliver better performance (I don’t know the answer – yet – to this “Amdahl’s Law” related question).

So work was needed on the code but try as I might I couldn’t get it to work.

Eventually I tracked the issue down to pages of memory being incorrectly marked as “read only” when they should be “read-write”. Read only pages stored locally don’t get written back to main memory and when you are using shared memory to communicate that’s a problem – you write your messages locally but they never get delivered.

I thought I found the bad code – a wayward #define – and a small test seemed to show that this did indeed fix the problem.

Except it didn’t and I realised last night that another piece of code was not creating “read only” page table entries in error (as the first piece of bad code did) but was converting “read write” page table entries to read only (it was meant to do the opposite – ie if you wrote to a read only page the page table entry was suitably updated, but sloppy coding meant instead it was routinely converting “read write” entries to “read only”.)

And the really awful thing was that – unlike the bad #define – this was a piece of code that had been inherited by every subsequent iteration of my simulator.

In other words – an awful lot of results I thought were definitive were wrong. Not actually wrong at such a fundamental level that I had just discovered I’d wasted several years of my life (I recently read a horror story about how a chemistry PhD wrote a whole thesis only to discover his or her “interesting result” was caused by accessing unallocated memory in Fortran 77) – but enough to mean the numbers are wrong.

Initial reaction was panic. Today it’s more stoical – I now have 11 simulations running to get new data and I probably need the same number or so again to fill in all the gaps.

The only positive is that it all has come so late in the process that I am confident in handling and interpreting the data that will come out of the fixed experiments.

But the downside is that I am running out of time and these experiments quite literally take weeks to run – and if there is a hitch with the university IT setup I go straight back to square one.

Wish me luck.

Reaching a decision

English: Distributed Memory
English: Distributed Memory (Photo credit: Wikipedia)

A week further on and not much C++ has been written – and now I think I need to make a new start.

Up to this point I have been trying to write a software model of the hardware and my thought was I could think put a software-modelling layer on top of that. But that simply is not going to work – it is just too complex.

Instead I am going to have to make some policy decisions in the software – essentially over how I model the local memory on the chip: each tile will process memory reads and writes and needs to know where that memory is – it could be in the global off-chip memory store or it could be on-chip.

The difference matters because, at least in theory, the on-chip memory is speedily accessible, while the off-chip memory is 50 to 100 to 500 times “further away”. ¬†Because memory accesses exhibit locality it makes sense to ship blocks of addressed memory from the global to the local store – but doing so takes time and if there are a lot of memory movements then we get thrashing.

What I now have to do is think of what policy I will use to decide what memory gets stored locally (or, more likely, what policy I use to map the addresses). I’ll start by once again reviewing papers that propose some schemes for existing Networks-on-Chip.

In other news: I have had a paper (of which I am co-author and first named author) accepted by OSPERTS 15 – so I will be off to Sweden to get mauled by the audience there in early July. It will be an experience, and I am looking forward to it, but I also think it might be not so much a baptism, but a destruction by fire.


Die of an Intel 80486DX2 microprocessor (actua...
Die of an Intel 80486DX2 microprocessor (actual size: 12√ó6.75 mm) in its packaging. (Photo credit: Wikipedia)

Been a while since I’ve written here – been avoiding writing about politics, which has obviously not been so great for me in the last couple of weeks… but now I have something else to ruminate on.

I have reached a milestone, or perhaps basecamp, in my PhD research: having a model for memory management that needs further exploration. (Hopefully there may even be a paper on this soon.)

Some of that exploration will have to be in hardware, and that’s really beyond me but I can and should build a software model to test how a many core system built using this new model might operate.

So far I have been testing or proving concepts with OVPSim but it does not allow me to build a true asynchronous multi-core model, so I need to do that in software myself.

But where to begin – I have a list of classes that you might want to have in C++:

  • Computer – which would aggregate…
    • DRAM
    • Storage
    • NoC – which would aggregate…
      • Mesh
      • Tiles – which would aggregate…
        • CPU
        • Cache
        • Ports (to the Mesh)

I hope you can see how quickly this becomes complex – and all we are talking about here is a simple software framework to allow me to do the research (ie., delivering the software, complex as it is, is only the very start.)

I am struggling to know where to begin – C++ seems like the logical choice for this, but it’s not proving to be much fun. Particularly because my CPU class has to be able to “execute” some code – I thought about using a DSL but may stick to the XML output I got from my hacked Valgrind Lackey – as at least I can then use existing XML engines.

Should I build from the XML up – eg., get a CPU class that can hack the XML and pass the requests up the chain (eg via the cache up to the Mesh and up to the DRAM etc), or what?

Slogging it out

Wrote around 100 lines of C++ tonight, but don’t feel I am making much progress in building my project: not least because many of those lines were to replace something I’d written earlier in the week.

To make matters worse I keep looking at older code I’ve written and thinking how beautiful it is, rather than the stuff I am just slapping into Vim right now. I tell myself I probably thought the same then too, but right now none of this feels like fun.

The mess is all over on Github (look me up as mcmenaminadrian).

Building a model

SVG Graph Illustrating Amdahl's Law
SVG Graph Illustrating Amdahl’s Law (Photo credit: Wikipedia)

Had a good discussion with my supervisor today – he essentially said, in terms, “expect to produce nonsense for 18 months” (meaning experimental results which seem not so useful). This was helpful as it made me get my worries about the last two weeks of tinkering on the edges of building the first “experiment” – a logical model of a NoC – into perspective.

The work goes on.

(And a new book arrived – Using OpenMP: Portable Shared Memory Parallel Programming – having consciously avoided “middleware” when writing my QD I decided I did actually need to know about it after all.)

Books about PhDs

I was wondering whether anyone has any recommendations for books about getting/doing a PhD. I am now reading The Unwritten Rules of PhD Research and it is very good, so I’d happily recommend it – it is one of those books that helps you see the underlying order in what appears to be the chaos going on around you.

For my domain of study Writing for Computer Science: The Art of effective Communication is also essential reading.

A supercomputer on every desktop? Not yet, anyway

My PhD is about operating systems on Network-on-Chip (NoC) computers. I have not actually done any research yet, so don’t expect anything here – but I have been playing with some existing data and I think it gives some interesting results.

NoCs are part of the semiconductor industry’s response to the failure of “Dennard scaling”: Moore‘s law says we can double the number of transistors on a chip every 18 – 24 months and we are still able to do that. Dennard scaling was the thing that made that super useful – because it meant the power requirements for the processors stayed constant even as they acquired more transistors. Now it has broken down, building faster chips becomes that much more difficult because, bluntly, they would burn up unless we limited the power.

NoCs aim to get round this by replacing one fast and power hungry processor on a single chip with several less powerful processors on the same chip – the idea being if we can attack the problem with several slower processors we can get the job done more quickly than if we used just one faster processor.

But there’s a catch, a big one, as captured in Seymour Cray‘s question:

would you rather plough a field with two strong bulls or a thousand chickens?

NoCs do not replace one fast chip with a dozen not quite as fast chips – they parcel out the power eaten by that one chip to the whole network on the chip – it’s not quite as bad as dividing the computing power by the number of chips (for that was the case there would be no advantage at all), but it is not fantastically above that.

Using work published by Hadi Esmaeilzadeh from the University of Washington along with others from the University of Wisconsin – Madison, the University of Texas at Austin and Microsoft Research, my projection is that, if we took one of today’s fast chips and parcelled up the power, then we would see computing power decline like this:

  • One processor: 100% computing power
  • Two processors: 65% computing power each
  • Four processors: 38% computing power each
  • Eight processors: 21% computing power each
  • Sixteen processors: 11% computing power each
  • Thirty-two processors: 6% computing power each
  • Sixty-four processors: 3% computing power each

Now, 64 x 3 = 192, so that might look like quite a good deal – a 92% speed up. But it is not that simple because some part of the code, even if just the bit that starts and ends your process, can only run on one processor even if all the rest can be split into 64 equal parts. And the pieces that will only run on one processor are now 33 times slower than they were before. The key balance is this: how much code can you run at nearly twice the speed (92% speed up) versus how much do you have to run at 33 times slower than before?

The answer is that you have to run a lot of code in the fast zone before you really see a big advantage.

45nm NoC modelled

As the graph suggests you would need to have about 99.9% of your code capable of running in parallel before you saw a guaranteed speedup with 64 processors in your NoC. Plenty of such code exists – such as in image handling and so on – but you are not likely to be running too much of it on your desktop computer at any given time (except perhaps when you are running a streaming video application) and the big disadvantage is that when you are not running the parallel code you are stuck with the 3% performance.

(Actually, it’s not quite as simple as that, as you may have a faster processor equipped to do the single thread of execution stuff, but then your computer starts to become more expensive.)

In the future chips will get faster – but maybe not that much faster. In a decade’s time they could be between 400% and 34% faster than they are today, depending on whether you are optimistic or pessimistic (realistic?) about processor technologies. That will help, but still not enough to put this technology in your desktop – as opposed to your games console or your television set or maybe a specialised processor in a top of the range machine.

So don’t expect your personal supercomputer too soon.