Lying in bed this morning and puzzling over what to do …
At first I thought what I should do is copy one of the existing operating system models for NoCs, but that simply would not be flexible enough.
What I have to do is model the hardware (including the modifications to the MMU I want to see) as, essentially, some form of black box, and build other layers – including the memory tree – above that. That means I need to separate the CPU/tiles from global memory: sounds simple in theory but implementing this is going to be very far from easy.
I am writing this down partly as a way of seeing if it makes sense…
Minor faults we can live with: if we have to reaffix a mapping back into the TLB it takes a few cycles but that is not a heavy burden. Major faults, however, are a different matter: they take many thousands of processor cycles to fix and have to be avoided at all costs.
Yet we have a basic problem in that we cannot mark each and every page access with a timestamp, so we have to guess at which page was least recently used – and so typically we use the sweeping hand of a CLOCK algorithm to do this.
But I have an even bigger problem – I have no time source, and hence no timer interrupts, to use.
So instead I want to try this:
All entries in the page table (PT) will have two validity bits, one we call VALID (V) and another AVAILABLE (A);
When a page is first mapped in the PT, both V and A will be marked as on.
When we first run out of space in the PT we will start to advance through the PT on each access turning off a V bit. So, if we pin the first three entries in the PT (eg for ‘kernel’ code, the PT itself and a stack), the first time we go round we might mark V off for the 16 entries in the PT from 2 – 18
Then, when we are mapping a page for which no mapping exists, we take the first entry where V is off
But before that we check whether we have any entries where A is on that map our page, in which case we switch V on for that mapping and we are off
Having written that out I can see we only need the A bit if we think that a 0x00 <–> 0x00 mapping might be a valid one, but we don’t have to make that assumption and so we can restrict this to just the V bit.
I have now written some very basic but functional paging code for the Microblaze but have realised I am, essentially, going to have to start a lot of it again.
The Microblaze puts page tables firmly under software control and its memory management unit (MMU) concentrates on managing page references through its translation lookaside buffers (TLBs) instead. So a reference to a virtual address that is mapped in a TLB is handled invisibly to the user or programmer, but a reference to a memory address that may be mapped in a page table but not in the TLB generates an exception and everything is dumped into the lap of the programmer – this is what is often known as a “soft fault” (a “hard fault” being when there is no version of the page being referenced available in physical memory at all).
The number of TLBs is quite limited – just 64 – so to handle (or in my case, simulate), say, 512kB of physical memory through 4kB page frames you need to expect to handle quite a few soft as well as hard faults and so you need to take a view on which mappings get evicted from the TLB and – if you are referencing more than 512kB of program and data – which pages get evicted from the page table and, hence, physical memory.
This is where the difficulties start, because there is no simple way to know when a read through a TLB mapping takes place – the very point is that it is invisible (and hence quick) to the programmer or any software she or he may have written. The problem with that is that – for most situations – the only guide you have to future behaviour in terms of memory references is past behaviour: so it would be very helpful to know whether a page has been recently accessed (on the basis that if it was accessed recently then it will be accessed again soon).
The standard response to this sort of problem is some form of “CLOCK” algorithm – which was first implemented, I believe, for the Multics operating system.
Multics can be thought of as the estranged (and now late) father of Unix – the “Uni” being a play on the “Multi” of the earlier system – and both direction and through its offspring its influence on computers has been profound and one of our inheritances is CLOCK, some version of which is almost certainly running in the computer, phone or tablet on which you are reading this.
The principle of CLOCK is simple. A “clock hand” sweeps regularly through the page mappings marking them invalid, then on a subsequent attempt to reuse the page mapping the valid bit has to be reset (meaning the page has been used recently) or alternatively if a new mapping is needed then we can through out the first page in the list of mappings where the mapping is marked as invalid.
And this, or some version of it is what I am now going to have to implement for the Microblaze. The obvious thing to do is to have some sort of timer interrupt drive the time clock – though I am not even sure the Microblaze has a timer interrupt available – I’m guessing it doesn’t – as it would expect those to come from the board, so this could be tricky!
Tonight I finally managed to get the code and the sequencing right to not just boot the Microblaze simulation, but to turn on the MMU.
But I was puzzled as to why, as soon as I had done that, the execution breaks with a segmentation fault and appeared to be executing code I had not written. And then it dawned on me – all my addresses had immediately become virtual addresses.
Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.
Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.
So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.
It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.
But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.
But my PhD research is pointing me towards some of the weaknesses of the paging approach.
At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.
Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.
But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.
Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.
I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.
I need to parse 220GB of XML to find an OPT reference string (i.e. to profile the memory references of a running application to create a page replacement algorithm that will know which is the most efficient page replacement decision.)
I had one algorithm in mind for this and I wrote the Groovy code but it kept exhausting the memory pool of the VM (as it generated a lot of anonymous memory). Even whacking the VM up in size quite considerably didn’t seem to fix the problem.
So I used another algorithm – and it works. After a fashion. Because even with 18 threads running in parallel on the University of York’s big iron compute server I think it might take around two years to complete. And I don’t have two years.
So I need to find something better – something where I have much more control over the memory allocation code and where I can also be sure of performance.
C is the obvious answer. Or, in this case, C with an interface to some C++ code I wrote a few years ago to build a red-black tree.
The strange thing is that, despite Groovy’s claim that it removes the really tedious bits of code writing by doing away with boiler plate, this weekend I have found writing all that getter/setter code quite therapeutic: perhaps this is a programmer’s comfort blanket?
Thinking about a Network-on-Chip system and what its system software needs to do…
Parallelisation is essential to efficiency – in a NoC there are a multitude of cores, but each core has only the fraction of the computational power a “traditional” unicore might be expected to have – therefore it is essential that, where possible, code is parallelised across as many cores as possible;
Each core needs to be able to access operating system services (via system calls or some other mechanism), but it is not necessarily the case that each core has to run a full or even a partial operating system – thus RPC or some other mechanism can be used to ‘remotely’ provide system services;
Application programmers want, above all, a single address space.