Our proposal here is to meet this need to use a memory space much greater than the locally available low-latency storage head on. The problems faced by programmers of today’s many- and multi-core devices are, in a sense, far from new: in the 1960s an earlier generation of programmers, experimenting with virtual memory and time-sharing computers, and using programs generated by the first generation of high-level languages were struck down by the phenomenon of thrashing, where computing time was lost to excessive memory management . The solutions we propose here draw on the lessons learned at that time, but significantly adapt the methods to account for the realities of using multicore devices with very limited local storage.
We propose to enable virtual memory  for such systems, using a novel form of paging which, instead of transferring whole pages across the memory hierarchy, concentrates on a minimal transfer of partial pages . Here hard faults still impose a significant penalty of lost cycles but the cost of paging in and out of memory is amortised across multiple memory reads and writes. This avoids unnecessary transfer and limits congestion in the memory interconnect, which we find to be a significant determinant of program performance.
One of the lessons of the 1960s was that maintaining a program’s working set in low latency memory was the most effective way of limiting paging and thrashing . Much research at that time and into the 1970s and later concentrated on effective ways to maintain the working set of pages in fast memory and innovation continued into the 1990s (and beyond), for instance with the 2Q algorithm in Linux . Here, though, we establish that better performance in embedded and real-time manycore systems is unlikely to come from a better page replacement algorithm but from a reorientation towards better management of the internals of the pages.
General computing page replacement algorithms are generally robust  and with the decreasing cost of memory, much recent research has concentrated on using large page sizes, not least to minimise TLB misses , we return to earlier research findings which emphasise the efficiency of small pages on small systems and we show that combining this with a FIFO page replacement policy may, perhaps counter-intuitively, deliver more time predictable or even faster performance than attempts to use some version of a least-recently used policy, though we also show that choice of page replacement algorithm is likely to be strongly related to the variation in working set size over a program’s lifecycle.
Concerns about power have always been important for embedded devices, but changing characteristics of hardware mean new issues such as “dark silicon”  will be important as we move deeper into the coming manycore era. We demonstrate here that constraints in power may be, to some extent, mitigated by a loosening of bottlenecks elsewhere in a realtime system, particularly in the memory interconnect: in a system where power is no constraint then congestion here dominates performance, where power concerns limit processor activities they may also lessen congestion.
Further we demonstrate results that some frequency scaling in manycore systems may not limit performance if it contributes to lessening congestion in the memory interconnect: results that suggest that the dark silicon issue may not be as big a barrier to performance in real-world applications as might be expected.