OPT and LRU compared for multithreaded code

Standard

Here is the fruit of many months labour:

simulationInterestingly, the two series are very close to isomorphic (i.e., they have essentially the same shape if one scales the x-plot appropriately).

This is based on running a short piece of code to decode 32 frames of video – the memory model used was 4k pages with a pooled 512k on chip (each chip can access any part of the pool at the cost of 1 tick) and unlimited memory 100 ticks away per 128 bits of read-in (so for a 4k page it takes 25,600 ticks to fault the page in).

This is a bit of a crude simplification of the memory model seen on, say, a Tilera TILE64 system, but it’s close enough (though this simulation had 16 cores).

Enhanced by Zemanta