My musical talent never really got beyond self-taught recorder about thirty-five years ago: I did teach myself to read music as a result and could play simple things like “Jingle Bells” on the piano, but I certainly never managed to get so far as writing anything.
That might have contributed to many years of misery as a teenager and point to why so many of my contemporaries (then and now) seemed desperate to be in a band – because it would appear your musical abilities are directly linked to your sexual attractiveness.
Over 140 years ago Charles Darwin first argued that birdsong and human music, having no clear survival benefit, were obvious candidates for sexual selection. Whereas the first contention is now universally accepted, his theory that music is a product of sexual selection through mate choice has largely been neglected. Here, I provide the first, to my knowledge, empirical support for the sexual selection hypothesis of music evolution by showing that women have sexual preferences during peak conception times for men that are able to create more complex music. Two-alternative forced-choice experiments revealed that woman only preferred composers of more complex music as short-term sexual partners when conception risk was highest. No preferences were displayed when women chose which composer they would prefer as a long-term partner in a committed relationship, and control experiments failed to reveal an effect of conception risk on women’s preferences for visual artists. These results suggest that women may acquire genetic benefits for offspring by selecting musicians able to create more complex music as sexual partners, and provide compelling support for Darwin’s assertion ‘that musical notes and rhythm were first acquired by the male or female progenitors of mankind for the sake of charming the opposite sex’.
This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.
It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.
I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.
The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.
In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).
Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.
That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?
I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.
Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.
So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.
This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.
The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.
Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?
Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.
Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.
So, the easy things do not work. Now I need to look at the hard things.
I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.
This follows on from the previous post – here are the plots.
These are based on a run of the PARSEC benchmark suite x264 program – encoding 32 frames of video using 18 threads – 16 worker threads: the plots show how often each 16 byte “line” is used – whether as an instruction is read or the memory is used for read-write storage. Sixteen bytes is both the size of a typical cache line and a read from a DDR memory.
The code plot might suggest there is some pattern – between about segments 100 (offset 0×640 inside the page) and 200 (offset 0xC80) there is an increased hit rate) but my guess is that is an artefact of the particular code being used here, rather a general issue (possible explanations would be a particular library function being very heavily used): though conceivably it could be an issue with the GCC compiler or linker.
That might be worth further exploration, but not for me. From visual inspection I am concluding that the distribution of accesses inside a 4k page doesn’t justify trying to pre-cache particular 16 byte “lines”.
Now I have mapped the speed of OPT and LRU using a traditional two level memory hierarchy, my task is to find something better that might make NoC a more viable computing platform.
One thought that occurred to me was that the preference of C and similar languages for page aligned memory in certain situations might make it feasible or reasonable to cache the first 128 bits (16 bytes) in faster access memory.
So I wrote some code to test which 16 byte “segments” of a page were accessed by code.
The list that follows is for read/write accesses (i.e. not instructions). I will get around to plotting a graph, but it looks like that idea – which was always a bit of a straw clutching exercise – is not going to fly.
Yesterday I set a new personal best (PB) for the Finsbury Park 5k Parkrun: 23 minutes and 50 seconds – 5 seconds faster than my previous PB, set on 22 February and a full 170 seconds faster than my first (completed – there was a previous failure!) run on 29 June last year.
To beat Mo Farrah in a re-run Olympics I need to shave another 10 minutes and 9 seconds – 609 seconds – off my time. Easy – at a linear rate of improvement I will manage that in another 2 years and 10 months. Rio here I come!
But why stop there: you will have noticed I have also taken 3.4 seconds off my 100 metre time. To be the fastest man in the world and beat all comers with 9.5 seconds for 100 metres I need just wait another 5 and a bit years.
Of course, I’m never even going to come close. My rate of improvement is slowing rapidly. A full 57 seconds of that 170 seconds – in other words more than one-third of it – came in the week between 29 June and 6 July last year.
Even the current rate of improvement – 5 seconds over 7 seven weeks – probably won’t be sustained for very much longer. And even at that rate I wouldn’t be beating Mo for another 16.4 years – by which time I’ll be 65.
Interestingly, the two series are very close to isomorphic (i.e., they have essentially the same shape if one scales the x-plot appropriately).
This is based on running a short piece of code to decode 32 frames of video – the memory model used was 4k pages with a pooled 512k on chip (each chip can access any part of the pool at the cost of 1 tick) and unlimited memory 100 ticks away per 128 bits of read-in (so for a 4k page it takes 25,600 ticks to fault the page in).
This is a bit of a crude simplification of the memory model seen on, say, a Tilera TILE64 system, but it’s close enough (though this simulation had 16 cores).