I have spent much of the last week working on optimising my Forth code, or more accurately, optimising the RISC-V assembly which actually does the work of executing the Forth words. (See this for the initial discussion.)
I cannot properly profile the assembly itself – my efforts to compile the code using
gcc -pg aren’t working (I’ll need to investigate if they can be made to work) because of linker issues, so I am stuck with “poor man’s profiling” – that is to say, just interrupting the code over repeated runs, noting where it is and building up a sample from that.
So with 90 samples from the completely unoptimised code the share of hit count looked like this:
We can see that “RUN” and “NEXT” (which aren’t actual Forth words but assembly labels) dominate but – as I discussed here, there really isn’t any room to optimise them – they are about as lean as we can get already.
What I did do was target my efforts on the code that handles constants (here marked as “ROW-BYTE-SIZE” and “ROW-SIZE”. The constant handling code used two generic routines to ensure that fixed sized blocks of code placed a fixed value on the stack and then passed control to “NEXT”.
The code was long-winded because it needed to handle any values without being tripped up by sign extension. For the jump code a number had to be loaded into a register and then a jump to the value of the register executed. The code was again designed to handle any value for the jump-to address – but that meant that, while NEXT was always at a 16 bit address, code to handle a full 64 bit address was being executed.
The simple thing to do was to place the needed values (both the value of the constant and the address of NEXT) into memory at a known address at compile time and then just load those addresses at run time.
And, in fact, this method could be used in some other places where I had relied on my hand crafted “build a 64 bit number through register adds” code – and it saved a lot of space and instructions.
So now (122 samples this time), the (crude, remember) profile count looks like this:
We can see that “NEXT” and “RUN” (NEXT is called by every compiled word at the end of execution and then passes control to RUN) are maybe even more dominant and ROW-BYTE-SIZE (which is called
90,000 27,000,000 times) has shrunken dramatically (and ROW-SIZE has gone completely).
The optimisation really does speed the code up too – execution time falls from about 9.5 seconds to about 8.5 speconds with all the various tweaks I’ve added.
But still the (32 bit) GForth version runs around 1 second faster than the (64 bit) Riscyforth version. Now, I didn’t start this project (Riscyforth that is) with the aim of writing the fastest-ever Forth but a 13% slow down does feel like a lot even if you are getting double the bits for it: so where can I go?
Some code is called a lot (eg., @/FETCH which has a body of just four instructions) but it is already (afaics) fully optimised:
CODEHEADERZ FETCH, @, BASE, 0x01 ld t0, 0(sp) ld t1, 0(t0) sd t1, 0(sp) tail NEXT
PLUS/+ is bigger but it too looks as thin as it can be to me (loads the top two numbers off the stack, adds them, stores back on the stack and shrinks the stack to match the loss of one value):
CODEHEADERZ PLUS, +, DOT, 0x01 ld a0, 0(sp) ld a1, 8(sp) add t0, a0, a1 sd t0, 8(sp) addi sp, sp, 8 tail NEXT
In fact everything looks pretty lean with the possible exception of the LOOP/+LOOP code, which share common execution paths:
CODEHEADER _PLUSLOOP, _UNLOOP, 0x0 ld t3, 0(sp) addi sp, sp, 8 ld t0, 0(s10) add t0, t0, t3 bgt t3, zero, loop_plusloop_entry # decrement on this path ld t1, 8(s10) ld t2, 16(s10) bge t1, t0, loop_over #reverse check sd t0, 0(s10) mv s7, t2 tail NEXT CODEHEADER _LOOP, _PLUSLOOP, 0x0 ld t0, 0(s10) #current value addi t0, t0, 1 loop_plusloop_entry: ld t1, 8(s10) #limit bge t0, t1, loop_over ld s7, 16(s10) #return point sd t0, 0(s10) tail NEXT loop_over: addi s10, s10, 24 tail NEXT
I have implemented +LOOP in a non-standard way – assuming that a negative increment implies a countdown – that should probably go (and instead I should add a non-standard -LOOP word in the same way as GForth does) but at best that is going to save one comparison and one jump (though cache-miss effects may make the jump more costly than it might look on the surface? There is a decent amount of stuff in the function header which might cause some cache misses.).
But, in general, it looks as though it will be hard to squeeze much more out of this.
Won’t hurt to try I suppose.
One response to “Forth code optimisation revisited”
[…] Forth code optimisation revisited (where I review what I have managed to achieve) […]