Last night, as Leicester City played Liverpool, Mo Salah, current top scorer in the English Premiership, was taken down in the penalty box and then stepped up to take the (correctly-awarded) penalty.
Salah hardly ever misses a penalty. It’s about as nailed on as you can get that he will score. Except last night he didn’t. Something – a badly placed kick, a goalkeeper at the top of his game, a momentary lack of concentration, who knows – meant he kicked the ball into the keepers hands (and then failed to score on the rebound).
Well, it happens. Form is temporary, class is permanent as the saying goes. But what struck me was I knew, I just knew, he was going to miss. Something in the few seconds before he struck the ball made me think he wasn’t going to do it.
To be honest, this is quite a common feeling (for me, anyway). But not all the time. David O’Leary’s famous “the nation holds its breath” penalty never bothered me at all. Never in doubt.
But the thing that interests me here is that I am sure I am the only one who feels this way about penalties and furthermore my belief – untested – is that when I feel that someone is going to miss I am right more times than I am wrong.
An unverified search on the internet says that 70% of penalties awarded in the English top flight are scored, so testing for people’s ability to predict a miss ought not to be too hard – (at these odds a random miss predictor would be relatively easy to spot).
If people can do it through some sort of split-second multiply parallel assessment of the factors – the morale of the team, the state of the game, the look of the keeper and, I think above all, the look and feel of the man taking the kick – the question is then whether we could build a machine that did this too.
What would be the point of that? Well, part from the “it’s interesting” factor there is presumably an arbitrage opportunity out there somewhere that would allow those informed by the machine to bet on the outcome. That, of course, would be of no net social benefit to humanity as a whole and probably isn’t something to be encouraged, but still I do think it’s fascinating.
Earlier this week I wrote that there appeared to be no evolutionary biological reason to think the corona virus would evolve at present to be less deadly. It’s starting to look like I was wrong.
In fact it looks as though the vaccines have driven selection of changes in the virus’s spike (to evade vaccination-driven antibodies) that make it less likely to be able to inflict serious harm on humans. If so it’s an unexpected bonus from what is already the great scientific achievement of the century. Anti-vaxxers should (but are unlikely to) note that those of us who did our duty to society by being vaccinated have delivered this, not them.
But, of course, it’s all far from good news. The Omicron variant is clearly much more transmissible and is still a killer. Those who have not been vaccinated are at particular risk. If numbers hospitalised keep rising at the anything like the current rate then the health system could still be swamped even if the risks to any individual are lower.
If you can, get vaccinated and help give us the tools to finish the job.
This is a further post about optimising Riscyforth, my Forth language forRISC-V single board computers.
Riscyforth is a 64-bit Forth which runs on top of Linux – my test system is the Sipeed Nezha SBC – which runs a 1 GHz single core CPU (so 1 cycle is one nano-second). I’m testing it against GForth, the GNU Project’s Forth.
Whilst Riscyforth is written in RISC-V assembly, GForth is mainly Forth on top of C. For those familiar with Forth techniques – Riscyforth is a very traditional indirect threaded implementation, GForth uses a mixture of this and (faster) direct threaded code.
GForth also sticks to the Forth standard of 32-bit numbers as the default, Riscyforth uses 64 bit numbers throughout.
GForth is a much more sophisticated (and portable) environment than Riscyforth – and certainly my suggestion/intention has never been to claim I will write a better Forth than GForth – the GForth effort goes back to the 1990s, Riscyforth goes back to 28 December 2020!
The quick summary of the material those two posts is that I could make significant timing savings in my code with some optimisation steps that replaced correct but long-winded compile-time generated code that loaded registers with fixed (at compile-time) numbers via a sequence of adds and bit shifts, with code that loaded those values from a known memory location.. However those optimisations were not enough to close the gap on performance with the GNU Project’s GForth – which runs matrix.fth about 12% faster.
My code started out at taking about 9.5 seconds and I have got that down to about 8.5 seconds, whilst GForth is taking about 7.5 seconds.
Not satisfied with leaving it there I wrote some code to count cycles taken in executing Forth words or groups of words, and looked closer at what might be fixable.
CODEHEADER DEBUGIN, Q, 0x0
la t0, CYCLESTART
sd t1, 0(t0)
CODEHEADER DEBUGOUT, DEBUGIN, 0x0
la t1, CYCLESTART
ld t2, 0(t1)
bgt t0, t2, debugout_normal
li t3, -1
sub t4, t3, t2
add t4, t4, t0
sub t4, t0, t2
la t5, CYCLECOUNT
ld t0, 0(t5)
add t1, t0, t4
sd t1, 0(t5)
la t6, CYCLEPINGS
ld t2, 0(t6)
addi t2, t2, 1
sd t2, 0(t6)
CODEHEADER DEBUGRESULTS, DEBUGOUT, 0x0
#(-- n n n)
la t0, CYCLECOUNT
la t1, CYCLEPINGS
ld t2, 0(t0)
ld t3, 0(t1)
div t4, t2, t3
addi sp, sp, -24
sd t2, 0(sp)
sd t3, 8(sp)
sd t4, 16(sp)
sd zero, 0(t0)
sd zero, 0(t1)
la t2, CYCLESTART
sd zero, 0(t2)
DEBUGIN records the cycle count at the start of some block of code, whilst DEBUGOUT calculates the time since the previous call to DEBUGIN, adds that to a total of elapsed times and increments a counter. DEBUGRESULTS then places, on the stack, the average time per IN – OUT cycle, the total number of samples and the total elapsed cycles.
Looking at the Forth code, the main word is what we call to run the program:
It initialises two 300 x 300 matrices (each initialisation taking about 53 million cycles) and then executes a loop and a nested loop: each loop runs 300 times, so the inner loop is executed 90000 times. The outer loop is recorded as taking just over 26 million cycles per execution whilst each inner loop takes about 87000 cycles per iteration.
The code to measure the cycles has an effect on the measurement so we can only be approximate here – eg while the inner loop is measured as taking 87000 cycles per iteration, each execution of innerproduct is measured as requiring around 88000 cycles (ie, longer than the code that calls it – an obvious contradiction). The test code itself is measured as taking about 15 cycles per execution – not much in itself, but rising to 409 million total cycles when called 27 million times inside innerproduct.
In any case it is clear that innerproduct is where the real work is done:
Inside innerproduct lurks another loop, again called 300 times, meaning that the code inside this loop gets called 300 x 300 x 300 – 27 million – times: the graphic above shows the approximate (having accounted for the cost of calling the instrumentation code) total cycle count for each line of that loop.
Right at the start of this (see the first post linked above), the big cycle burner was the constant code: row-byte-size here, so it’s good that no longer seems to be the case.
It’s hard to be precise about this (a test of the row-size constant in main suggested, accounting for the cost of the instrumentation code, that it took about 269 cycles per execution but that looks to be a significant over-estimate as that would imply, if row-byte-size took the same time as the other constant (as it should) that 27 million executions of it alone should take about 7 billion cycles. A better guide seems to be that the nine words of the first line take a little bit less than twice the cycles/time of the five words of the second line – suggesting that each word takes on average a roughly similar time.
The final line in innerproduct is not inside the innermost loop and so is only called 90,000 times and takes, roughly, 8 million cycles.
It’s not possible to measure the cost of the loop control code using my simple instrumentation (as execution is not linear – all the work after the loop initialisation is done at the loop word) but as the total execution time of innerproduct (as measured inside innerproduct itself) is estimated at 7.6 billion cycles it seems unlikely the do ... loop code is particularly expensive.
This is all good news at one level: there appear to be no bits of code here chewing up cycles outrageously (each of the 14 words of the first two lines take roughly 19 cycles to execute, including the time taken to move from one word to another via the Forth threading).
But at level another it represents a bit of a dead end for efforts to optimise the code. All the words being executed here are pared down to the minimum and there is no room to squeeze anything out of them.
I covered that before in the initial posts but here’s a quick reminder/example:
While the s9 register is used to manage the return stack (ie as a stack pointer to that stack) – so you can see these words just transfer a 64 bit value to and from two different stacks and there is just nothing that can be done to save cycles here (or if there is, I’d very much welcome being told about it).
So, it seems, GForth is the winner here – a Forth based on optimised C is not just portable but would appear to be faster too.
But, actually, I have some counter evidence.
The Forth program factoriel.f calculates the numerals of factorials. In Riscyforth I can calculate the factorial of 10000 (ie, 10000!) in about 26 seconds:
But GForth cannot manage it at all – though if I double the number of cells allocated in GForth (which means both systems have the same amount of resources due to the difference between 64 and 32 bits) it will work – but it takes around 36 seconds.
GForth will calculate the factorial of smaller numbers faster than Riscyforth though: if we try 3200! (close to the limit GForth can manage with the original settings for cell numbers) then GForth takes about 7 seconds and Riscyforth takes about 8 seconds.
Taken together these results suggest to me that at least there is nothing broken in Riscyforth. It’s about as fast as it could be with 64 bit arithmetic and the indirect threaded model.
(I need to start this by saying I have no medical or epidemiological qualifications and I’m not a statistician, though I do have some knowledge there – so if anyone who is any of these things spots an error, let me know.)
Around this time last year I was telling people that I thought Covid-19 was a 50-year problem – as it would take that long (at least) for childhood-acquired immunity to be sufficiently widespread to allow humanity to relax.
By the summer that looked pessimistic. Yes, the Delta variant was more transmissible but we had the weapons to fight back and it was seemingly sufficiently genetically stable to present a hard target that the might of human science was bearing down on a truly impressive manner.
Now, sitting in London, quite possibly the world’s worst-hit city, with local infections (at ward level) heading past 2% of the population and the R number climbing to 1.4 or higher, that all feels pretty hubristic.
We still don’t know what is perhaps the most important thing about Omicron – what its hospitalisation (and ultimately, death) rates will be – the comparison with Guateng so beloved by those ideologically opposed to firm action seems pretty specious given the different histories of infection and population.
Perhaps the infection will be less severe – though there is zero evolutionary biological reason I can see for that: the virus will likely only evolve to be less dangerous if there is an advantage in that, but where is the evolutionary pressure to do that? The fact is that our weaponry (unavoidably) is likely pushing the virus in pretty much towards becoming more dangerous – to multiply faster to spread faster and to evade the vaccines by mutating the protein spike we are are targeting. It’s a war of technology and the virus isn’t suing for peace.
The prospect of our hospitals system collapsing under vast number of ill people (especially in places like London where relatively few have been jabbed) is therefore very real and we won’t know for a few weeks yet if the worst will happen.
But my question here is not about what we do in the next month, but what do we do in the next decade. so here are some thoughts:
Boosters are going to be here to stay, like they are for flu – we need to restructure our health services to handle the task of annually jabbing over 60 million people. And also create the expectation that people will take their booster – both through carrots and sticks.
Related to that we need to have a system that shares the risks and costs of the annual booster updates: given that it will effectively be a guaranteed income stream for the pharma companies the level of risk will have dramatically reduced, but we should be wary of thinking that means we can expect companies and shareholders to tie up capital for no return. As the pandemic has demonstrated that our health security truly has to be global, the expectation must be that the richer countries subsidise this for the less well-off. Nor can we tolerate – globally – health care systems that limit access to vaccines, even in rich countries, by ability to pay.
We need a network of international treaties that cover this – not least because international travel for business or tourism is going to broken for a long time without them (just another reason why Brexit was a stupid idea, but that’s not for this post).
We will need to restructure our economies – not just because a new balance between home and office working is likely to be permanent, but also because an economy that relies on a season of excess in the middle of winter is now broken. There isn’t going to be a “normal Christmas” anytime soon and governments need to face up to that.
And – and this is perhaps the most long-term issue. If vaccines were the Manhattan Project – a massive effort that delivered a working weapon in the face of a pressing need, we are now on to the search for the thermonuclear device, which in this case means something that deals with the cytokine storm that is still killing people even though the infection has been seen off.
That will require a long-term partnership between government and pharma to incentivise the (economically) highly-risky leading-edge research that is needed here. Yes, we could just rely on university medical schools to do this but why shouldn’t we instead ensure private capital is deployed? The benefits of new therapies here are likely to stretch well-beyond Covid-19 but we may not see them for decades.
I have spent much of the last week working on optimising my Forth code, or more accurately, optimising the RISC-V assembly which actually does the work of executing the Forth words. (See this for the initial discussion.)
I cannot properly profile the assembly itself – my efforts to compile the code using gcc -pg aren’t working (I’ll need to investigate if they can be made to work) because of linker issues, so I am stuck with “poor man’s profiling” – that is to say, just interrupting the code over repeated runs, noting where it is and building up a sample from that.
So with 90 samples from the completely unoptimised code the share of hit count looked like this:
We can see that “RUN” and “NEXT” (which aren’t actual Forth words but assembly labels) dominate but – as I discussed here, there really isn’t any room to optimise them – they are about as lean as we can get already.
What I did do was target my efforts on the code that handles constants (here marked as “ROW-BYTE-SIZE” and “ROW-SIZE”. The constant handling code used two generic routines to ensure that fixed sized blocks of code placed a fixed value on the stack and then passed control to “NEXT”.
The code was long-winded because it needed to handle any values without being tripped up by sign extension. For the jump code a number had to be loaded into a register and then a jump to the value of the register executed. The code was again designed to handle any value for the jump-to address – but that meant that, while NEXT was always at a 16 bit address, code to handle a full 64 bit address was being executed.
The simple thing to do was to place the needed values (both the value of the constant and the address of NEXT) into memory at a known address at compile time and then just load those addresses at run time.
And, in fact, this method could be used in some other places where I had relied on my hand crafted “build a 64 bit number through register adds” code – and it saved a lot of space and instructions.
So now (122 samples this time), the (crude, remember) profile count looks like this:
We can see that “NEXT” and “RUN” (NEXT is called by every compiled word at the end of execution and then passes control to RUN) are maybe even more dominant and ROW-BYTE-SIZE (which is called 90,000 27,000,000 times) has shrunken dramatically (and ROW-SIZE has gone completely).
The optimisation really does speed the code up too – execution time falls from about 9.5 seconds to about 8.5 speconds with all the various tweaks I’ve added.
But still the (32 bit) GForth version runs around 1 second faster than the (64 bit) Riscyforth version. Now, I didn’t start this project (Riscyforth that is) with the aim of writing the fastest-ever Forth but a 13% slow down does feel like a lot even if you are getting double the bits for it: so where can I go?
Some code is called a lot (eg., @/FETCH which has a body of just four instructions) but it is already (afaics) fully optimised:
I have implemented +LOOP in a non-standard way – assuming that a negative increment implies a countdown – that should probably go (and instead I should add a non-standard -LOOP word in the same way as GForth does) but at best that is going to save one comparison and one jump (though cache-miss effects may make the jump more costly than it might look on the surface? There is a decent amount of stuff in the function header which might cause some cache misses.).
But, in general, it looks as though it will be hard to squeeze much more out of this.
Lying in the bath this morning – relaxing into the Christmas holiday (despite the raging epidemic of omicron-variant Covid that is currently rampaging through London in an increasingly frightening manner), I had thoughts about several blog posts to write over the break, and this is the first…
It’s a book recommendation: Ian Stewart‘s Concepts of Modern Mathematics. It’s around 50 years since its first edition (and even this Dover edition is 26 years old) but – part from the disconcerting habit of assuming all humans are male – it doesn’t feel like a dated work at all.
In one other way, though, it is a product of its times – the controversy over “new mathematics” – the teaching methods adopted widely across the west in the 60s and 70s in an effort to improve mathematical understanding of students and move beyond rote-teaching. Interestingly, as the wikipedia article demonstrates, its adoption was driven, in part, by the post-Sputnik panic in US science and engineering.
It’s adoption was certainly controversial as Tom Lehrer makes clear here…
Now, having been educated in both Ireland (primary and secondary) and England (secondary) I have actually seen both approaches in operation at first hand – and its certainly the case that the traditional methods gave me a solid (algorithmic in the sense of using a simple and repetitive method) grounding in such things as long division which I think were missing in England, but it’s also not difficult to see why discussions of things like sets, modular arithmetic and matrices are pretty essential for anyone thinking of a career in science and technology.
That said the UK’s School Mathematics Project seems to have avoided what appears to be the excesses of “new math” in the US. I cannot produce any evidence for it but I am tempted to wonder if the UK’s 1980s (and continuing) software boom was in part driven by its innovation in maths teaching (read Francis Spufford’s excellent Backroom Boys for more on that).
Indeed these things are so fundamental to software development that it’s hard to imagine a case for not teaching them – but 50 years ago they were clearly seen as in some way esoteric or even fetishistic.
Professor Stewart’s book concentrates on understanding of the concepts and not rigorous proofs – which makes the book readable and digestible in settings like my bath. I am not going to say it could be mastered by anyone regardless of their current understanding of maths, but its not a traditional textbook and I would suggest it would be a good backgrounder if you had basic (O level/GCSE) maths qualifications and were interested in working as a software developer.
One of the points the author makes in his preface is that the UK has seen its science and technology bases erode as more and more graduates look for careers in law and financial services. I hope that will become the subject of another blog here shortly.
I discovered when I tried to write my optimised code from last night that there was a problem with it – the various register add instructions use a 5 bit number to number the registers being accessed – allowing access to registers 0 to 31.
But the program counter is register number 32, so just can’t be used in these commands. After a bit of head scratching I came up with a solution:
That aui t0, 0 adds 0 to the value in the pc and stores it in t0, so is the equivalent of mv t0, pc – and then by just taking an offset straight from t0 I also saved another instruction.
Does it work? Yes, it compiles and runs.
But does it work by saving time? Yes – knocks about half a second off the benchmark execution time: roughly a 6% saving, which isn’t bad. But my code is still rather adrift of GForth, so I need to look deeper.
A while ago I ran some very simple pieces of Forth – essentially nothing more than empty loops – and was pleased to see that Riscyforth – my assembly-based Forth for RISC-V ran about 10% faster than GForth – the GNU project’s C-based Forth.
But a more serious benchmark – an O(n**3) matrix multiplication program – (see below) – tells a different story.
\ .( Loading Matrix Multiplication benchmark...) cr
\ NOTE: This version needs 0.5MB data space
\ A classical benchmark of an O(n**3) algorithm; Matrix Multiplication
\ Part of the programs gathered by John Hennessy for the MIPS
\ RISC project at Stanford. Translated to forth by Marty Fraeman,
\ Johns Hopkins University/Applied Physics Laboratory.
\ MM forth2c doesn't have it !
: mybounds over + swap ;
\ : cell 8 ;
: initiate-seed ( -- ) 74755 seed ! ;
: random ( -- n ) seed @ 1309 * 13849 + 65535 and dup seed ! ;
300 constant row-size
row-size cells constant row-byte-size
row-size row-size * constant mat-size
mat-size cells constant mat-byte-size
align create ima mat-byte-size allot
align create imb mat-byte-size allot
align create imr mat-byte-size allot
: initiate-matrix ( m[row-size][row-size] -- )
mat-byte-size mybounds do
random dup 120 / 120 * - 60 - i !
: innerproduct ( a[row][*] b[*][column] -- int)
0 row-size 0 do
>r over @ over @ * r> + >r
swap cell+ swap row-byte-size +
>r 2drop r>
: main ( -- )
imr ima mat-byte-size mybounds do
imb row-byte-size mybounds do
j i innerproduct over ! cell+
row-size cells +loop
This code runs somewhat more than 10% more slowly on Riscyforth than on GForth.
So how do I optimise it?
Unfortunately using a profiling tool isn’t an option as, as far I am aware, there isn’t one. Instead I have to run the code and just hit break a few times and see where it stops.
This is what this gives me:
(Here I have used ‘CONSTANT’ for the the ‘ROW-BYTE-SIZE’ constant for ease of presentation.)
NEXT and RUN are the obvious candidates, you might think, but they are already thin offerings:
ld s8, 0(s7) #word address register takes content of next secondary
addi s7, s7, ADDRWIDTH #next secondary along
ld t0, 0(s8) #extract first instruction address of primative
addi s8, s8, ADDRWIDTH #increment WA
jalr zero, t0, 0 #run the code
Essentially I don’t see what I can shave off there – there are just five instructions between them and no bloat.
I did find a small change in LOOP:
Here I was able to shave a single instruction off – but that isn’t going to make a big difference. But the other functions mentioned above seem lean otherwise.
The real task is to optimise the CONSTANT code. The compilation process generates the constant code on the fly – as each constant needs its own code to return a number to the stack and to return to the normal stream of execution:
This is the code for ROW-BYTE-SIZE:
0x0000003ff7aa2418: li t0,0
0x0000003ff7aa241c: slli t0,t0,0x20
0x0000003ff7aa2420: lui t1,0x0
0x0000003ff7aa2424: slli t1,t1,0x20
0x0000003ff7aa2428: srli t1,t1,0x20
0x0000003ff7aa242c: li t2,1
0x0000003ff7aa2430: slli t2,t2,0xb
0x0000003ff7aa2434: add t0,t0,t2
0x0000003ff7aa2438: li t2,352
0x0000003ff7aa243c: add t0,t0,t2
0x0000003ff7aa2440: add t0,t0,t1
0x0000003ff7aa2444: addi sp,sp,-8
0x0000003ff7aa2448: sd t0,0(sp)
0x0000003ff7aa244c: li t0,0
0x0000003ff7aa2450: slli t0,t0,0x20
0x0000003ff7aa2454: lui t1,0x10
0x0000003ff7aa2458: slli t1,t1,0x20
0x0000003ff7aa245c: srli t1,t1,0x20
0x0000003ff7aa2460: li t2,1
0x0000003ff7aa2464: slli t2,t2,0xb
0x0000003ff7aa2468: add t0,t0,t2
0x0000003ff7aa246c: li t2,236
0x0000003ff7aa2470: add t0,t0,t2
0x0000003ff7aa2474: add t0,t0,t1
0x0000003ff7aa2478: jr t0
It’s long (96 bytes is long compared to most Forth primitives) because it’s designed to ensure it will work with all addresses and numbers so, eg at 0x3FF7AA2424 and 0x3FF7AA2428 numbers are shifted 20 bits to the left and then 20 bits to the right to avoid sign extension issues.
But there are ways to optimize this – the code is basically in two lumps: between 0x3FF7AA2418 – 0x3FF7AA2448 it reads a number into reigisters and stores it on the stack – this number is fixed at compile time.
Then between 0x3FF7AA244C and 0x3FF7AA2478 it first loads the address of the NEXT function into register t0 then jumps to that address. Again this address is known at compile time and is fixed.
So the obvious solution is to write both numbers as 64-bit constants into memory at compile time and then read them eg (I haven’t tested this code – just written it as I’ve gone along here).
If that works it would replace 25 instructions with just 8 – the next task is to see if it does!
If it does I will be a little sad – I worked quite hard to get that position/number independent code to work. But it also won’t be the first time on this project that I have found a simpler approach gives better and faster code.