A few years ago I set out to write a clone of Sinclair BASIC as a domain-specific language (DSL) in Groovy. The end result was BINSIC, but it was much closer to an interpreter than a DSL: it turned out that strange ways that Groovy handled capitalised function/closure names meant that I could not match keywords at all and so interpretation became the only option (though there were DSL-like features lurking under the covers).
Now I have set out to write an interpreter for RISCV assembly and have written something which is much more like a DSL. My aim was to track the memory reference patterns of various real-time benchmarks running on RISCV systems – based on the disassembly generated by Spike, the RISCV emulator.
Getting that done requires tracking register state – because in true RISC fashion reads and writes are done not to immediates, but to addresses held in registers with immediates used as offsets.
To make this happen every (or at least every used) RISCV instruction is mapped to a closure and the operands treated as closure parameters. The closures are all inside the RegisterFile class so all can access the registers to keep the state updated. This makes it quite like a DSL but I make no claims for purity: every statement is passed through a regular expression based interpreter to separate out the parameters: if nothing else that eliminates a lot of boilerplate code that would have to be stuck inside the closures.
Memory is treated as a sparse array through a Groovy Map instance – with an address (64-bit Long) mapped to an 8-bit Byte.
The process isn’t perfect – a disassembly doesn’t given you the contents of initialised data, so an attempt to access those addresses is quite likely to raise an exception which needs to be trapped and the memory accounted for – by simply assigning zero to the address: it’s a kludge but it seems to work.
This is a bit niche, but I spent a fair bit of this weekend working out what the algorithm to calculate how long a memory request packet would take to traverse a binary tree (from a leaf to the root) was. And now I have written a Groovy script (testable in your browser at https://groovyconsole.appspot.com/script/5109475228778496 – click on ‘edit’ and then ‘run’) to calculate the results.
(I am sure that this has been done before – many times – but I couldn’t find the algorithm on a quick search and then the desire to solve this myself took over).
The problem is this: memory request packets enter the tree at a leaf, they then have to cross a series of multiplexors until they reach the root (which is where the tree interfaces with the memory supply). Each multiplexor (mux) has two inputs and one output, so taken together the muxes form a binary tree.
As request packets could arrive at the leaves of a mux at the same time, there has to be a decision procedure about which packet progresses and which is delayed. The simple choice is to favour packets on either the left or the right leaf (in my case I went with the right). The question is then what is the average and maximum delay for a request packet.
So, if you muck about with the script you’ll see that the absolute maximum delay on a 256 leaf tree (eg., on a Bluetree memory tree connected to a 256 tile NoC) is 495 cycles, while the average maximum delay (ie for a leaf in the middle of the tree) is 248 cycles.
In contrast if the load is just 1% then these figures fall to about 3 and 1.5.
There is a flaw here though – because this all assumes that no new packets enter the system while the original request packet is traversing it – but in a dynamic system, even with a load as low as 1%, this is highly unlikely – packets that enter the tree “to the right” of the original request could potentially add to the delay if they chain with packets already present.
I need to learn the Scala programming language. Unfortunately the University of York’s library does not seem to have an electronic copy of “Programming in Scala” (Amazon link) but it did have an electronic copy of Springer’s “A Beginner’s Guide to Scala” (Amazon link). But this is a very poor book, referring back to examples that don’t exist, offering a very confusing explanation as to what object-orientation is and, to cap it all, illustrating examples with pseudo code rather than Scala itself.
Of course, it’s not the worst example you’ll find from Springer – “Unleash the System on Chip using FPGAs…” is easily the worst book I have ever seen in print (see my review on Amazon here) – and, of course they also publish many useful books and conference proceedings and so on. But they appear to have close to a monopoly on many aspects of computer science publications and use that ruthlessly.
If it wasn’t for other factors I’d suggest the European Commission needs to take a close look at them. Hardly worth it in the UK’s case these days though.
A week further on and not much C++ has been written – and now I think I need to make a new start.
Up to this point I have been trying to write a software model of the hardware and my thought was I could think put a software-modelling layer on top of that. But that simply is not going to work – it is just too complex.
Instead I am going to have to make some policy decisions in the software – essentially over how I model the local memory on the chip: each tile will process memory reads and writes and needs to know where that memory is – it could be in the global off-chip memory store or it could be on-chip.
The difference matters because, at least in theory, the on-chip memory is speedily accessible, while the off-chip memory is 50 to 100 to 500 times “further away”. Because memory accesses exhibit locality it makes sense to ship blocks of addressed memory from the global to the local store – but doing so takes time and if there are a lot of memory movements then we get thrashing.
What I now have to do is think of what policy I will use to decide what memory gets stored locally (or, more likely, what policy I use to map the addresses). I’ll start by once again reviewing papers that propose some schemes for existing Networks-on-Chip.
In other news: I have had a paper (of which I am co-author and first named author) accepted by OSPERTS 15 – so I will be off to Sweden to get mauled by the audience there in early July. It will be an experience, and I am looking forward to it, but I also think it might be not so much a baptism, but a destruction by fire.
Lying in bed this morning and puzzling over what to do …
At first I thought what I should do is copy one of the existing operating system models for NoCs, but that simply would not be flexible enough.
What I have to do is model the hardware (including the modifications to the MMU I want to see) as, essentially, some form of black box, and build other layers – including the memory tree – above that. That means I need to separate the CPU/tiles from global memory: sounds simple in theory but implementing this is going to be very far from easy.
I hope you can see how quickly this becomes complex – and all we are talking about here is a simple software framework to allow me to do the research (ie., delivering the software, complex as it is, is only the very start.)
I am struggling to know where to begin – C++ seems like the logical choice for this, but it’s not proving to be much fun. Particularly because my CPU class has to be able to “execute” some code – I thought about using a DSL but may stick to the XML output I got from my hacked Valgrind Lackey – as at least I can then use existing XML engines.
Should I build from the XML up – eg., get a CPU class that can hack the XML and pass the requests up the chain (eg via the cache up to the Mesh and up to the DRAM etc), or what?
Had a good meeting with my PhD supervisor today: he was in London – I didn’t have to make a flying visit to York.
So the next steps with my OVPsimMicroblaze code is to model global and local memory – by default OVPsim treats all memory as local, mapping the full 32-bit address space and sparsely filling that as needed. I have imposed an additional constraint of only mapping a few pages, but these are still all “local”: so while the code takes time to execute, what is in effect, a FIFOpage replacement algorithm, there is no time for page loads.
The way round this seems to be to build my global memory as a memory-mapped peripheral device – I can then impose time delays on reads and writes.
But I suppose I am writing this blog instead of writing that code…