More on the Riscyforth versus Gforth shootout

This is a further post about optimising Riscyforth, my Forth language for RISC-V single board computers.

Riscyforth is a 64-bit Forth which runs on top of Linux – my test system is the Sipeed Nezha SBC – which runs a 1 GHz single core CPU (so 1 cycle is one nano-second). I’m testing it against GForth, the GNU Project’s Forth.

Whilst Riscyforth is written in RISC-V assembly, GForth is mainly Forth on top of C. For those familiar with Forth techniques – Riscyforth is a very traditional indirect threaded implementation, GForth uses a mixture of this and (faster) direct threaded code.

GForth also sticks to the Forth standard of 32-bit numbers as the default, Riscyforth uses 64 bit numbers throughout.

GForth is a much more sophisticated (and portable) environment than Riscyforth – and certainly my suggestion/intention has never been to claim I will write a better Forth than GForth – the GForth effort goes back to the 1990s, Riscyforth goes back to 28 December 2020!

The two key posts before this are:

The quick summary of the material those two posts is that I could make significant timing savings in my code with some optimisation steps that replaced correct but long-winded compile-time generated code that loaded registers with fixed (at compile-time) numbers via a sequence of adds and bit shifts, with code that loaded those values from a known memory location.. However those optimisations were not enough to close the gap on performance with the GNU Project’s GForth – which runs matrix.fth about 12% faster.

My code started out at taking about 9.5 seconds and I have got that down to about 8.5 seconds, whilst GForth is taking about 7.5 seconds.

Not satisfied with leaving it there I wrote some code to count cycles taken in executing Forth words or groups of words, and looked closer at what might be fixable.

		#debug word
		rdcycle t1
		sd t1, 0(t0)
		tail NEXT
		#debug word
		rdcycle t0
		ld t2, 0(t1)
		bgt t0, t2, debugout_normal
		li t3, -1
		sub t4, t3, t2
		add t4, t4, t0
		j debugout_sum
		sub t4, t0, t2
		ld t0, 0(t5)
		add t1, t0, t4
		sd t1, 0(t5)
		ld t2, 0(t6)
		addi t2, t2, 1
		sd t2, 0(t6)
		tail NEXT

		#(-- n n n)
		ld t2, 0(t0)
		ld t3, 0(t1)
		div t4, t2, t3
		addi sp, sp, -24
		sd t2, 0(sp)
		sd t3, 8(sp)
		sd t4, 16(sp)
		sd zero, 0(t0)
		sd zero, 0(t1)
		sd zero, 0(t2)
		tail NEXT

DEBUGIN records the cycle count at the start of some block of code, whilst DEBUGOUT calculates the time since the previous call to DEBUGIN, adds that to a total of elapsed times and increments a counter. DEBUGRESULTS then places, on the stack, the average time per IN – OUT cycle, the total number of samples and the total elapsed cycles.

Looking at the Forth code, the main word is what we call to run the program:

It initialises two 300 x 300 matrices (each initialisation taking about 53 million cycles) and then executes a loop and a nested loop: each loop runs 300 times, so the inner loop is executed 90000 times. The outer loop is recorded as taking just over 26 million cycles per execution whilst each inner loop takes about 87000 cycles per iteration.

The code to measure the cycles has an effect on the measurement so we can only be approximate here – eg while the inner loop is measured as taking 87000 cycles per iteration, each execution of innerproduct is measured as requiring around 88000 cycles (ie, longer than the code that calls it – an obvious contradiction). The test code itself is measured as taking about 15 cycles per execution – not much in itself, but rising to 409 million total cycles when called 27 million times inside innerproduct.

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
  >r 2drop r>

In any case it is clear that innerproduct is where the real work is done:

Inside innerproduct lurks another loop, again called 300 times, meaning that the code inside this loop gets called 300 x 300 x 300 – 27 million – times: the graphic above shows the approximate (having accounted for the cost of calling the instrumentation code) total cycle count for each line of that loop.

Right at the start of this (see the first post linked above), the big cycle burner was the constant code: row-byte-size here, so it’s good that no longer seems to be the case.

It’s hard to be precise about this (a test of the row-size constant in main suggested, accounting for the cost of the instrumentation code, that it took about 269 cycles per execution but that looks to be a significant over-estimate as that would imply, if row-byte-size took the same time as the other constant (as it should) that 27 million executions of it alone should take about 7 billion cycles. A better guide seems to be that the nine words of the first line take a little bit less than twice the cycles/time of the five words of the second line – suggesting that each word takes on average a roughly similar time.

The final line in innerproduct is not inside the innermost loop and so is only called 90,000 times and takes, roughly, 8 million cycles.

It’s not possible to measure the cost of the loop control code using my simple instrumentation (as execution is not linear – all the work after the loop initialisation is done at the loop word) but as the total execution time of innerproduct (as measured inside innerproduct itself) is estimated at 7.6 billion cycles it seems unlikely the do ... loop code is particularly expensive.

This is all good news at one level: there appear to be no bits of code here chewing up cycles outrageously (each of the 14 words of the first two lines take roughly 19 cycles to execute, including the time taken to move from one word to another via the Forth threading).

But at level another it represents a bit of a dead end for efforts to optimise the code. All the words being executed here are pared down to the minimum and there is no room to squeeze anything out of them.

I covered that before in the initial posts but here’s a quick reminder/example:

		POP t0
		addi s9, s9, -STACKOFFSET
		sd t0, 0(s9)
		tail NEXT

		ld t0, 0(s9)
		PUSH t0
		addi s9, s9, STACKOFFSET
		tail NEXT

The >R and R> words are called repeatedly in the loop. They use the POP and PUSH macros which add or remove a 64 bit register from the stack:

.macro PUSH register
        addi sp, sp, -8
        sd  \register, 0(sp)

.macro POP register
        ld \register, 0(sp)
        addi sp, sp, 8

While the s9 register is used to manage the return stack (ie as a stack pointer to that stack) – so you can see these words just transfer a 64 bit value to and from two different stacks and there is just nothing that can be done to save cycles here (or if there is, I’d very much welcome being told about it).

So, it seems, GForth is the winner here – a Forth based on optimised C is not just portable but would appear to be faster too.

But, actually, I have some counter evidence.

The Forth program factoriel.f calculates the numerals of factorials. In Riscyforth I can calculate the factorial of 10000 (ie, 10000!) in about 26 seconds:

But GForth cannot manage it at all – though if I double the number of cells allocated in GForth (which means both systems have the same amount of resources due to the difference between 64 and 32 bits) it will work – but it takes around 36 seconds.

GForth will calculate the factorial of smaller numbers faster than Riscyforth though: if we try 3200! (close to the limit GForth can manage with the original settings for cell numbers) then GForth takes about 7 seconds and Riscyforth takes about 8 seconds.

Taken together these results suggest to me that at least there is nothing broken in Riscyforth. It’s about as fast as it could be with 64 bit arithmetic and the indirect threaded model.