Looking for a mate? Better get musical then


Charles Darwin
Charles Darwin

My musical talent never really got beyond self-taught recorder about thirty-five years ago: I did teach myself to read music as a result and could play simple things like “Jingle Bells” on the piano, but I certainly never managed to get so far as writing anything.

That might have contributed to many years of misery as a teenager and point to why so many of my contemporaries (then and now) seemed desperate to be in a band – because it would appear your musical abilities are directly linked to your sexual attractiveness.

According to a study, conducted by Benjamin Charlton of the University of Sussex, shows that women are more interested in sex with skilled composers of music than those without much musical talent. The abstract states:

Over 140 years ago Charles Darwin first argued that birdsong and human music, having no clear survival benefit, were obvious candidates for sexual selection. Whereas the first contention is now universally accepted, his theory that music is a product of sexual selection through mate choice has largely been neglected. Here, I provide the first, to my knowledge, empirical support for the sexual selection hypothesis of music evolution by showing that women have sexual preferences during peak conception times for men that are able to create more complex music. Two-alternative forced-choice experiments revealed that woman only preferred composers of more complex music as short-term sexual partners when conception risk was highest. No preferences were displayed when women chose which composer they would prefer as a long-term partner in a committed relationship, and control experiments failed to reveal an effect of conception risk on women’s preferences for visual artists. These results suggest that women may acquire genetic benefits for offspring by selecting musicians able to create more complex music as sexual partners, and provide compelling support for Darwin’s assertion ‘that musical notes and rhythm were first acquired by the male or female progenitors of mankind for the sake of charming the opposite sex’.

 

Enhanced by Zemanta

None of the easy things work


This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.

It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.

I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.

The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.

In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).

Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.

That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?

I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.

Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.

So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.

This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.

The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.

OPT and LRU compared - both fundamentally stymied by memory shortage
OPT and LRU compared – both fundamentally stymied by memory shortage

 

Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?

Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.

Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.

So, the easy things do not work. Now I need to look at the hard things.

I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.

Enhanced by Zemanta

Which bit of a 4k page does your program use?


This follows on from the previous post – here are the plots.

These are based on a run of the PARSEC benchmark suite x264 program – encoding 32 frames of video using 18 threads – 16 worker threads: the plots show how often each 16 byte “line” is used – whether as an instruction is read or the memory is used for read-write storage. Sixteen bytes is both the size of a typical cache line and a read from a DDR memory.

all code memonly

 

The code plot might suggest there is some pattern – between about segments 100 (offset 0×640 inside the page) and 200 (offset 0xC80) there is an increased hit rate) but my guess is that is an artefact of the particular code being used here, rather a general issue (possible explanations would be a particular library function being very heavily used): though conceivably it could be an issue with the GCC compiler or linker.

That might be worth further exploration, but not for me. From visual inspection I am concluding that the distribution of accesses inside a 4k page doesn’t justify trying to pre-cache particular 16 byte “lines”.

Enhanced by Zemanta

Looks like one hope/theory knocked on the head


Samsung Now Mass Producing Industry’s Most Adv...
Samsung DDR4 (Photo credit: samsungtomorrow)

Now I have mapped the speed of OPT and LRU using a traditional two level memory hierarchy, my task is to find something better that might make NoC a more viable computing platform.

One thought that occurred to me was that the preference of C and similar languages for page aligned memory in certain situations might make it feasible or reasonable to cache the first 128 bits (16 bytes) in faster access memory.

So I wrote some code to test which 16 byte “segments” of a page were accessed by code.

The list that follows is for read/write accesses (i.e. not instructions). I will get around to plotting a graph, but it looks like that idea – which was always a bit of a straw clutching exercise – is not going to fly.

 

Segment: 0 Count: 3145700

Segment: 1 Count: 4892407

Segment: 2 Count: 3207293

Segment: 3 Count: 3790303

Segment: 4 Count: 5315932

Segment: 5 Count: 8143085

Segment: 6 Count: 6060159

Segment: 7 Count: 3900910

Segment: 8 Count: 10018940

Segment: 9 Count: 3738990

Segment: 10 Count: 3654813

Segment: 11 Count: 3887982

Segment: 12 Count: 4046926

Segment: 13 Count: 7366857

Segment: 14 Count: 5029997

Segment: 15 Count: 5555853

Segment: 16 Count: 5101695

Segment: 17 Count: 4919006

Segment: 18 Count: 5536030

Segment: 19 Count: 3679621

Segment: 20 Count: 3496416

Segment: 21 Count: 4175369

Segment: 22 Count: 4126270

Segment: 23 Count: 3264013

Segment: 24 Count: 4100255

Segment: 25 Count: 4322633

Segment: 26 Count: 9749020

Segment: 27 Count: 19622721

Segment: 28 Count: 11919679

Segment: 29 Count: 3679721

Segment: 30 Count: 8063852

Segment: 31 Count: 11752619

Segment: 32 Count: 23894391

Segment: 33 Count: 14935880

Segment: 34 Count: 17010081

Segment: 35 Count: 12806447

Segment: 36 Count: 11758764

Segment: 37 Count: 13757944

Segment: 38 Count: 5967876

Segment: 39 Count: 8410322

Segment: 40 Count: 9751409

Segment: 41 Count: 10568546

Segment: 42 Count: 8047474

Segment: 43 Count: 9064589

Segment: 44 Count: 9219878

Segment: 45 Count: 7913851

Segment: 46 Count: 4868922

Segment: 47 Count: 5057902

Segment: 48 Count: 6649602

Segment: 49 Count: 8562603

Segment: 50 Count: 3609239

Segment: 51 Count: 3965129

Segment: 52 Count: 4869654

Segment: 53 Count: 5149858

Segment: 54 Count: 2055722

Segment: 55 Count: 32499129

Segment: 56 Count: 2733892

Segment: 57 Count: 8887833

Segment: 58 Count: 2329809

Segment: 59 Count: 2946391

Segment: 60 Count: 2946919

Segment: 61 Count: 8444848

Segment: 62 Count: 2890405

Segment: 63 Count: 3106806

Segment: 64 Count: 3349164

Segment: 65 Count: 4892046

Segment: 66 Count: 4453014

Segment: 67 Count: 4943979

Segment: 68 Count: 4595481

Segment: 69 Count: 6552969

Segment: 70 Count: 15521259

Segment: 71 Count: 4037689

Segment: 72 Count: 3737825

Segment: 73 Count: 5438653

Segment: 74 Count: 5012634

Segment: 75 Count: 4229649

Segment: 76 Count: 11143742

Segment: 77 Count: 6171722

Segment: 78 Count: 11938365

Segment: 79 Count: 11041493

Segment: 80 Count: 6030829

Segment: 81 Count: 9557385

Segment: 82 Count: 23490594

Segment: 83 Count: 9314300

Segment: 84 Count: 11119997

Segment: 85 Count: 8870504

Segment: 86 Count: 9059886

Segment: 87 Count: 4581160

Segment: 88 Count: 13002081

Segment: 89 Count: 6050274

Segment: 90 Count: 4850064

Segment: 91 Count: 4495531

Segment: 92 Count: 6551494

Segment: 93 Count: 6294404

Segment: 94 Count: 5852555

Segment: 95 Count: 9759357

Segment: 96 Count: 5688209

Segment: 97 Count: 9991460

Segment: 98 Count: 5155181

Segment: 99 Count: 3923768

Segment: 100 Count: 3319723

Segment: 101 Count: 8659639

Segment: 102 Count: 4416737

Segment: 103 Count: 2534491

Segment: 104 Count: 2470280

Segment: 105 Count: 2709695

Segment: 106 Count: 2535015

Segment: 107 Count: 3050061

Segment: 108 Count: 2772341

Segment: 109 Count: 3621580

Segment: 110 Count: 3332878

Segment: 111 Count: 4358518

Segment: 112 Count: 4282092

Segment: 113 Count: 3523498

Segment: 114 Count: 3111625

Segment: 115 Count: 3474108

Segment: 116 Count: 2809375

Segment: 117 Count: 3592879

Segment: 118 Count: 2768848

Segment: 119 Count: 10525789

Segment: 120 Count: 4122297

Segment: 121 Count: 3660249

Segment: 122 Count: 3432971

Segment: 123 Count: 3671499

Segment: 124 Count: 3066933

Segment: 125 Count: 3271576

Segment: 126 Count: 7069611

Segment: 127 Count: 3774229

Segment: 128 Count: 3258290

Segment: 129 Count: 2416892

Segment: 130 Count: 3413264

Segment: 131 Count: 2789339

Segment: 132 Count: 7841394

Segment: 133 Count: 2992968

Segment: 134 Count: 3624867

Segment: 135 Count: 3304507

Segment: 136 Count: 3573405

Segment: 137 Count: 4226925

Segment: 138 Count: 4803447

Segment: 139 Count: 5630354

Segment: 140 Count: 6420305

Segment: 141 Count: 4721786

Segment: 142 Count: 4860223

Segment: 143 Count: 4183175

Segment: 144 Count: 3790705

Segment: 145 Count: 7974241

Segment: 146 Count: 4146253

Segment: 147 Count: 3063269

Segment: 148 Count: 3485084

Segment: 149 Count: 2923729

Segment: 150 Count: 2947715

Segment: 151 Count: 3818073

Segment: 152 Count: 2769076

Segment: 153 Count: 2645308

Segment: 154 Count: 4452525

Segment: 155 Count: 4793146

Segment: 156 Count: 2281903

Segment: 157 Count: 3256076

Segment: 158 Count: 2414992

Segment: 159 Count: 2951958

Segment: 160 Count: 1747487

Segment: 161 Count: 2385269

Segment: 162 Count: 4128250

Segment: 163 Count: 5101661

Segment: 164 Count: 3579155

Segment: 165 Count: 3746030

Segment: 166 Count: 3117725

Segment: 167 Count: 3516756

Segment: 168 Count: 6842484

Segment: 169 Count: 2145906

Segment: 170 Count: 2281955

Segment: 171 Count: 3043248

Segment: 172 Count: 2946803

Segment: 173 Count: 2638829

Segment: 174 Count: 3543883

Segment: 175 Count: 3509146

Segment: 176 Count: 7295889

Segment: 177 Count: 3061840

Segment: 178 Count: 4201176

Segment: 179 Count: 22324025

Segment: 180 Count: 7157776

Segment: 181 Count: 14223711

Segment: 182 Count: 5079461

Segment: 183 Count: 6497476

Segment: 184 Count: 3395786

Segment: 185 Count: 3594557

Segment: 186 Count: 4511617

Segment: 187 Count: 3377908

Segment: 188 Count: 4015556

Segment: 189 Count: 3884031

Segment: 190 Count: 4418048

Segment: 191 Count: 4425675

Segment: 192 Count: 4151888

Segment: 193 Count: 4073928

Segment: 194 Count: 7515643

Segment: 195 Count: 4260252

Segment: 196 Count: 15629731

Segment: 197 Count: 8020217

Segment: 198 Count: 8051874

Segment: 199 Count: 8100937

Segment: 200 Count: 4875180

Segment: 201 Count: 6493053

Segment: 202 Count: 5447678

Segment: 203 Count: 5587693

Segment: 204 Count: 4748458

Segment: 205 Count: 5004972

Segment: 206 Count: 6418567

Segment: 207 Count: 10145254

Segment: 208 Count: 5678528

Segment: 209 Count: 5601157

Segment: 210 Count: 5702977

Segment: 211 Count: 5590463

Segment: 212 Count: 6693545

Segment: 213 Count: 4030951

Segment: 214 Count: 5199543

Segment: 215 Count: 7942092

Segment: 216 Count: 6376629

Segment: 217 Count: 7480443

Segment: 218 Count: 5150624

Segment: 219 Count: 4318579

Segment: 220 Count: 4535156

Segment: 221 Count: 3908294

Segment: 222 Count: 4547756

Segment: 223 Count: 4509968

Segment: 224 Count: 2944729

Segment: 225 Count: 3301802

Segment: 226 Count: 3638158

Segment: 227 Count: 4838325

Segment: 228 Count: 9253359

Segment: 229 Count: 2775284

Segment: 230 Count: 3601569

Segment: 231 Count: 3482137

Segment: 232 Count: 3594477

Segment: 233 Count: 2952485

Segment: 234 Count: 3315834

Segment: 235 Count: 2438082

Segment: 236 Count: 4846832

Segment: 237 Count: 2711387

Segment: 238 Count: 5907452

Segment: 239 Count: 3551899

Segment: 240 Count: 3621086

Segment: 241 Count: 3032466

Segment: 242 Count: 3572000

Segment: 243 Count: 3020379

Segment: 244 Count: 3275738

Segment: 245 Count: 3398733

Segment: 246 Count: 4023338

Segment: 247 Count: 3839542

Segment: 248 Count: 6434160

Segment: 249 Count: 3999611

Segment: 250 Count: 4388716

Segment: 251 Count: 8178958

Segment: 252 Count: 3622946

Segment: 253 Count: 3564538

Segment: 254 Count: 3182242

Segment: 255 Count: 3133097

Enhanced by Zemanta

When I am going to beat Mo Farrah and Usain Bolt: the weakness of linear projections


Usain Bolt in celebration about 1 or 2 seconds...
Usain Bolt in celebration about 1 or 2 seconds after his 100m victory at Beijing Olympics 2008, breaking the world record. (Photo credit: Wikipedia)

Yesterday I set a new personal best (PB) for the Finsbury Park 5k Parkrun: 23 minutes and 50 seconds – 5 seconds faster than my previous PB, set on 22 February and a full 170 seconds faster than my first (completed – there was a previous failure!) run on 29 June last year.

To beat Mo Farrah in a re-run Olympics I need to shave another 10 minutes and 9 seconds – 609 seconds – off my time. Easy – at a linear rate of improvement I will manage that in another 2 years and 10 months. Rio here I come!

But why stop there: you will have noticed I have also taken 3.4 seconds off my 100 metre time. To be the fastest man in the world and beat all comers with 9.5 seconds for 100 metres I need just wait another 5 and a bit years.

Of course, I’m never even going to come close. My rate of improvement is slowing rapidly. A full 57 seconds of that 170 seconds – in other words more than one-third of it – came in the week between 29 June and 6 July last year.

Even the current rate of improvement – 5 seconds over 7 seven weeks – probably won’t be sustained for very much longer. And even at that rate I wouldn’t be beating Mo for another 16.4 years – by which time I’ll be 65.

The moral is: beware of linear projections.

But if I train a bit harder…

Enhanced by Zemanta

OPT and LRU compared for multithreaded code


Here is the fruit of many months labour:

simulationInterestingly, the two series are very close to isomorphic (i.e., they have essentially the same shape if one scales the x-plot appropriately).

This is based on running a short piece of code to decode 32 frames of video – the memory model used was 4k pages with a pooled 512k on chip (each chip can access any part of the pool at the cost of 1 tick) and unlimited memory 100 ticks away per 128 bits of read-in (so for a 4k page it takes 25,600 ticks to fault the page in).

This is a bit of a crude simplification of the memory model seen on, say, a Tilera TILE64 system, but it’s close enough (though this simulation had 16 cores).

Enhanced by Zemanta
Follow

Get every new post delivered to your Inbox.

Join 887 other followers

%d bloggers like this: