How I actually extended Riscyforth


Before I suggested I was going to add dynamic linking by rolling on object files but that turned out to be pretty inflexible.

Then I thought I could pass needed initialisation values via use of the gp register (using it as a stack pointer). That actually seemed to work but it was pointed out to me that I could be taking a big gamble on what might happens when interrupts were called and I had a bogus value in that register.

The solution turned out to be quite simple – a very small shared library that is linked to both to the main executable and to the modules – I can even make it easy on myself and write it in C:

/** RISCYLIB                                            **/
/** Common code for Riscyforth and Riscyforth modules   **/
/** Copyright (C) Adrian McMenamin, 2022                **/
/** Licenced for reuse under the terms of v2 of GNU GPL **/

unsigned long nextAddress = 0;
unsigned long dictionaryAddress = 0;

unsigned long getNextAddress(void)
{
	return nextAddress;
}

void setNextAddress(unsigned long addressIn)
{
	nextAddress = addressIn;
}

unsigned long getDictionaryAddress(void)
{
	return dictionaryAddress;
}

void setDictionaryAddress(unsigned long addressIn)
{
	dictionaryAddress = addressIn;
}

This just providers getters and setters for two variables which are used to provide the address of the NEXT function (called by every Forth word at the end of execution) and to manage to end end of the dictionary (as we load modules we add their word definitions to the end of the dictionary).

Midnite Commander gives us some more information about the compiled libriscy.so:

Here dictionaryAddress and nextAddress symbols are marked as ‘B’ – ie in the BSS section, while the get and set functions/sysmpols are marked as T – ie in the “text”, or executable, section of the library.

Adding dynamic linking to Riscyforth


This is one of those blogs one writes to clarify inchoate thoughts, so it’s sorry-not-sorry if it starts to wander around…

Last week I got to an important point with Riscyforth, my Forth for RISC-V single board computers: I tagged my first release (v0.1).

I did that because I had, finally, managed to cover all of the core and extended core code of the Forth 2012 standard. Essentially that means something written using words from the core of the language should compile and run (and give correct results).

But merely covering the core definitions doesn’t deliver a system that many real-world users will want to play with except out of curiosity. No floating points maths, no control over screen output beyond, essentially, printing line by line (though there are a few additions here like coloured output) and certainly no graphics.

But I also don’t want to build an ever-larger monolithic system. Users should be able to pick and choose which parts of the extended language they get (as well, of course, being free to extend it in Forth itself).

That is going to require some form of dynamic loading and linking. Whilst all the tools are now available for anyone (with a screw-loose) to write their own extended floating point words using the integer primitives, it’s much more sensible to provide words (written in assembly) that call the library code provided by the operating system.

My initial thought was that there should be a riscyforth.conf file in the Riscyforth distribution that would list what modules are loaded on startup. But I now think a more flexible way is to provide a new Forth word – LOADMODULE ("<name>" -- flag) say – that would specify a module that could be loaded. This word could be used at anytime, but if we put it in a short Forth file/program – eg startup.fth – then Riscyforth could be made to execute this at start-up time and so users would edit that program/file to fix what they started with.

But how to implement? The best way seems to be to supply object files (along with the source, this is a GPL-licenced project) and have mmap load them into memory. My code already uses mmap to create the space where new words can be written, so the theory seems sound enough and I have found this on the web which says it can be done. But it might also be a bit of a challenge – but then again, it wouldn’t be any fun if it wasn’t.

More on the Riscyforth versus Gforth shootout


This is a further post about optimising Riscyforth, my Forth language for RISC-V single board computers.

Riscyforth is a 64-bit Forth which runs on top of Linux – my test system is the Sipeed Nezha SBC – which runs a 1 GHz single core CPU (so 1 cycle is one nano-second). I’m testing it against GForth, the GNU Project’s Forth.

Whilst Riscyforth is written in RISC-V assembly, GForth is mainly Forth on top of C. For those familiar with Forth techniques – Riscyforth is a very traditional indirect threaded implementation, GForth uses a mixture of this and (faster) direct threaded code.

GForth also sticks to the Forth standard of 32-bit numbers as the default, Riscyforth uses 64 bit numbers throughout.

GForth is a much more sophisticated (and portable) environment than Riscyforth – and certainly my suggestion/intention has never been to claim I will write a better Forth than GForth – the GForth effort goes back to the 1990s, Riscyforth goes back to 28 December 2020!

The two key posts before this are:

The quick summary of the material those two posts is that I could make significant timing savings in my code with some optimisation steps that replaced correct but long-winded compile-time generated code that loaded registers with fixed (at compile-time) numbers via a sequence of adds and bit shifts, with code that loaded those values from a known memory location.. However those optimisations were not enough to close the gap on performance with the GNU Project’s GForth – which runs matrix.fth about 12% faster.

My code started out at taking about 9.5 seconds and I have got that down to about 8.5 seconds, whilst GForth is taking about 7.5 seconds.

Not satisfied with leaving it there I wrote some code to count cycles taken in executing Forth words or groups of words, and looked closer at what might be fixable.

		CODEHEADER DEBUGIN, Q, 0x0
		#(--)
		#debug word
		la t0, CYCLESTART
		rdcycle t1
		sd t1, 0(t0)
		tail NEXT
		
		CODEHEADER DEBUGOUT, DEBUGIN, 0x0
		#(--)
		#debug word
		rdcycle t0
		la t1, CYCLESTART
		ld t2, 0(t1)
		bgt t0, t2, debugout_normal
		li t3, -1
		sub t4, t3, t2
		add t4, t4, t0
		j debugout_sum
  debugout_normal:
		sub t4, t0, t2
  debugout_sum:
		la t5, CYCLECOUNT
		ld t0, 0(t5)
		add t1, t0, t4
		sd t1, 0(t5)
		la t6, CYCLEPINGS
		ld t2, 0(t6)
		addi t2, t2, 1
		sd t2, 0(t6)
		tail NEXT

		CODEHEADER DEBUGRESULTS, DEBUGOUT, 0x0
		#(-- n n n)
		la t0, CYCLECOUNT
		la t1, CYCLEPINGS
		ld t2, 0(t0)
		ld t3, 0(t1)
		div t4, t2, t3
		addi sp, sp, -24
		sd t2, 0(sp)
		sd t3, 8(sp)
		sd t4, 16(sp)
		sd zero, 0(t0)
		sd zero, 0(t1)
		la t2, CYCLESTART
		sd zero, 0(t2)
		tail NEXT

DEBUGIN records the cycle count at the start of some block of code, whilst DEBUGOUT calculates the time since the previous call to DEBUGIN, adds that to a total of elapsed times and increments a counter. DEBUGRESULTS then places, on the stack, the average time per IN – OUT cycle, the total number of samples and the total elapsed cycles.

Looking at the Forth code, the main word is what we call to run the program:

It initialises two 300 x 300 matrices (each initialisation taking about 53 million cycles) and then executes a loop and a nested loop: each loop runs 300 times, so the inner loop is executed 90000 times. The outer loop is recorded as taking just over 26 million cycles per execution whilst each inner loop takes about 87000 cycles per iteration.

The code to measure the cycles has an effect on the measurement so we can only be approximate here – eg while the inner loop is measured as taking 87000 cycles per iteration, each execution of innerproduct is measured as requiring around 88000 cycles (ie, longer than the code that calls it – an obvious contradiction). The test code itself is measured as taking about 15 cycles per execution – not much in itself, but rising to 409 million total cycles when called 27 million times inside innerproduct.

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
    r>
  loop
  >r 2drop r>
;

In any case it is clear that innerproduct is where the real work is done:

Inside innerproduct lurks another loop, again called 300 times, meaning that the code inside this loop gets called 300 x 300 x 300 – 27 million – times: the graphic above shows the approximate (having accounted for the cost of calling the instrumentation code) total cycle count for each line of that loop.

Right at the start of this (see the first post linked above), the big cycle burner was the constant code: row-byte-size here, so it’s good that no longer seems to be the case.

It’s hard to be precise about this (a test of the row-size constant in main suggested, accounting for the cost of the instrumentation code, that it took about 269 cycles per execution but that looks to be a significant over-estimate as that would imply, if row-byte-size took the same time as the other constant (as it should) that 27 million executions of it alone should take about 7 billion cycles. A better guide seems to be that the nine words of the first line take a little bit less than twice the cycles/time of the five words of the second line – suggesting that each word takes on average a roughly similar time.

The final line in innerproduct is not inside the innermost loop and so is only called 90,000 times and takes, roughly, 8 million cycles.

It’s not possible to measure the cost of the loop control code using my simple instrumentation (as execution is not linear – all the work after the loop initialisation is done at the loop word) but as the total execution time of innerproduct (as measured inside innerproduct itself) is estimated at 7.6 billion cycles it seems unlikely the do ... loop code is particularly expensive.

This is all good news at one level: there appear to be no bits of code here chewing up cycles outrageously (each of the 14 words of the first two lines take roughly 19 cycles to execute, including the time taken to move from one word to another via the Forth threading).

But at level another it represents a bit of a dead end for efforts to optimise the code. All the words being executed here are pared down to the minimum and there is no room to squeeze anything out of them.

I covered that before in the initial posts but here’s a quick reminder/example:

		CODEHEADERZ TOR, >R, TOR2, 0x01 #>R
		POP t0
		addi s9, s9, -STACKOFFSET
		sd t0, 0(s9)
		tail NEXT

		CODEHEADERZ RFROM, R>, TOR, 0x01 #R>
		ld t0, 0(s9)
		PUSH t0
		addi s9, s9, STACKOFFSET
		tail NEXT

The >R and R> words are called repeatedly in the loop. They use the POP and PUSH macros which add or remove a 64 bit register from the stack:

.macro PUSH register
        addi sp, sp, -8
        sd  \register, 0(sp)
.endm

.macro POP register
        ld \register, 0(sp)
        addi sp, sp, 8
.endm

While the s9 register is used to manage the return stack (ie as a stack pointer to that stack) – so you can see these words just transfer a 64 bit value to and from two different stacks and there is just nothing that can be done to save cycles here (or if there is, I’d very much welcome being told about it).

So, it seems, GForth is the winner here – a Forth based on optimised C is not just portable but would appear to be faster too.

But, actually, I have some counter evidence.

The Forth program factoriel.f calculates the numerals of factorials. In Riscyforth I can calculate the factorial of 10000 (ie, 10000!) in about 26 seconds:

But GForth cannot manage it at all – though if I double the number of cells allocated in GForth (which means both systems have the same amount of resources due to the difference between 64 and 32 bits) it will work – but it takes around 36 seconds.

GForth will calculate the factorial of smaller numbers faster than Riscyforth though: if we try 3200! (close to the limit GForth can manage with the original settings for cell numbers) then GForth takes about 7 seconds and Riscyforth takes about 8 seconds.

Taken together these results suggest to me that at least there is nothing broken in Riscyforth. It’s about as fast as it could be with 64 bit arithmetic and the indirect threaded model.

Forth code optimisation revisited


I have spent much of the last week working on optimising my Forth code, or more accurately, optimising the RISC-V assembly which actually does the work of executing the Forth words. (See this for the initial discussion.)

I cannot properly profile the assembly itself – my efforts to compile the code using gcc -pg aren’t working (I’ll need to investigate if they can be made to work) because of linker issues, so I am stuck with “poor man’s profiling” – that is to say, just interrupting the code over repeated runs, noting where it is and building up a sample from that.

So with 90 samples from the completely unoptimised code the share of hit count looked like this:

Unoptimised hit count

We can see that “RUN” and “NEXT” (which aren’t actual Forth words but assembly labels) dominate but – as I discussed here, there really isn’t any room to optimise them – they are about as lean as we can get already.

What I did do was target my efforts on the code that handles constants (here marked as “ROW-BYTE-SIZE” and “ROW-SIZE”. The constant handling code used two generic routines to ensure that fixed sized blocks of code placed a fixed value on the stack and then passed control to “NEXT”.

The code was long-winded because it needed to handle any values without being tripped up by sign extension. For the jump code a number had to be loaded into a register and then a jump to the value of the register executed. The code was again designed to handle any value for the jump-to address – but that meant that, while NEXT was always at a 16 bit address, code to handle a full 64 bit address was being executed.

The simple thing to do was to place the needed values (both the value of the constant and the address of NEXT) into memory at a known address at compile time and then just load those addresses at run time.

And, in fact, this method could be used in some other places where I had relied on my hand crafted “build a 64 bit number through register adds” code – and it saved a lot of space and instructions.

So now (122 samples this time), the (crude, remember) profile count looks like this:

We can see that “NEXT” and “RUN” (NEXT is called by every compiled word at the end of execution and then passes control to RUN) are maybe even more dominant and ROW-BYTE-SIZE (which is called 90,000 27,000,000 times) has shrunken dramatically (and ROW-SIZE has gone completely).

The optimisation really does speed the code up too – execution time falls from about 9.5 seconds to about 8.5 speconds with all the various tweaks I’ve added.

But still the (32 bit) GForth version runs around 1 second faster than the (64 bit) Riscyforth version. Now, I didn’t start this project (Riscyforth that is) with the aim of writing the fastest-ever Forth but a 13% slow down does feel like a lot even if you are getting double the bits for it: so where can I go?

Some code is called a lot (eg., @/FETCH which has a body of just four instructions) but it is already (afaics) fully optimised:

		CODEHEADERZ FETCH, @,  BASE, 0x01
		ld t0, 0(sp)
		ld t1, 0(t0)
		sd t1, 0(sp)
		tail NEXT

PLUS/+ is bigger but it too looks as thin as it can be to me (loads the top two numbers off the stack, adds them, stores back on the stack and shrinks the stack to match the loss of one value):

                CODEHEADERZ PLUS, +, DOT, 0x01
		ld a0, 0(sp)
		ld a1, 8(sp)
                add t0, a0, a1
		sd t0, 8(sp)
		addi sp, sp, 8
                tail NEXT

In fact everything looks pretty lean with the possible exception of the LOOP/+LOOP code, which share common execution paths:

		CODEHEADER _PLUSLOOP, _UNLOOP, 0x0
		ld t3, 0(sp)
		addi sp, sp, 8
		ld t0, 0(s10)
		add t0, t0, t3
		bgt t3, zero, loop_plusloop_entry
		# decrement on this path
		ld t1, 8(s10)
		ld t2, 16(s10)
		bge t1, t0, loop_over		#reverse check
		sd t0, 0(s10)
		mv s7, t2
		tail NEXT

		CODEHEADER _LOOP, _PLUSLOOP, 0x0
		ld t0, 0(s10)			#current value
		addi t0, t0, 1
  loop_plusloop_entry:
		ld t1, 8(s10)			#limit
		bge t0, t1, loop_over
		ld s7, 16(s10)			#return point
		sd t0, 0(s10)
		tail NEXT
  loop_over:
		addi s10, s10, 24
		tail NEXT

I have implemented +LOOP in a non-standard way – assuming that a negative increment implies a countdown – that should probably go (and instead I should add a non-standard -LOOP word in the same way as GForth does) but at best that is going to save one comparison and one jump (though cache-miss effects may make the jump more costly than it might look on the surface? There is a decent amount of stuff in the function header which might cause some cache misses.).

But, in general, it looks as though it will be hard to squeeze much more out of this.

Won’t hurt to try I suppose.

The actual optimised code


I discovered when I tried to write my optimised code from last night that there was a problem with it – the various register add instructions use a 5 bit number to number the registers being accessed – allowing access to registers 0 to 31.

But the program counter is register number 32, so just can’t be used in these commands. After a bit of head scratching I came up with a solution:

aui t0, 0
ld t1, 24(t0)
addi sp, sp, -8
sd t1, 0(sp)
ld t1, 32(t0)
jr t1

That aui t0, 0 adds 0 to the value in the pc and stores it in t0, so is the equivalent of mv t0, pc – and then by just taking an offset straight from t0 I also saved another instruction.

Does it work? Yes, it compiles and runs.

But does it work by saving time? Yes – knocks about half a second off the benchmark execution time: roughly a 6% saving, which isn’t bad. But my code is still rather adrift of GForth, so I need to look deeper.

Optimising my Forth code


A while ago I ran some very simple pieces of Forth – essentially nothing more than empty loops – and was pleased to see that Riscyforth – my assembly-based Forth for RISC-V ran about 10% faster than GForth – the GNU project’s C-based Forth.

But a more serious benchmark – an O(n**3) matrix multiplication program – (see below) – tells a different story.

\ .( Loading Matrix Multiplication benchmark...) cr
\ NOTE: This version needs 0.5MB data space

\ A classical benchmark of an O(n**3) algorithm; Matrix Multiplication
\
\ Part of the programs gathered by John Hennessy for the MIPS
\ RISC project at Stanford. Translated to forth by  Marty Fraeman,
\ Johns Hopkins University/Applied Physics Laboratory.

\ MM forth2c doesn't have it !
: mybounds  over + swap ;

variable seed
\ : cell 8 ;

: initiate-seed ( -- )  74755 seed ! ;
: random  ( -- n )  seed @ 1309 * 13849 + 65535 and dup seed ! ;

300 constant row-size
row-size cells constant row-byte-size

row-size row-size * constant mat-size
mat-size cells constant mat-byte-size

align create ima mat-byte-size allot
align create imb mat-byte-size allot
align create imr mat-byte-size allot

: initiate-matrix ( m[row-size][row-size] -- )
  mat-byte-size mybounds do
    random dup 120 / 120 * - 60 - i !
  cell +loop
;

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
    r>
  loop
  >r 2drop r>
;

: main  ( -- )
  initiate-seed
  ima initiate-matrix
  imb initiate-matrix 
  imr ima mat-byte-size mybounds do
    imb row-byte-size mybounds do
      j i innerproduct over ! cell+ 
    cell +loop
  row-size cells +loop
  drop
;

This code runs somewhat more than 10% more slowly on Riscyforth than on GForth.

So how do I optimise it?

Unfortunately using a profiling tool isn’t an option as, as far I am aware, there isn’t one. Instead I have to run the code and just hit break a few times and see where it stops.

This is what this gives me:

(Here I have used ‘CONSTANT’ for the the ‘ROW-BYTE-SIZE’ constant for ease of presentation.)

NEXT and RUN are the obvious candidates, you might think, but they are already thin offerings:

NEXT:
                ld s8, 0(s7)                        #word address register takes content of next secondary
                addi s7, s7, ADDRWIDTH              #next secondary along
  
  RUN:
                ld t0, 0(s8)                        #extract first instruction address of primative
                addi s8, s8, ADDRWIDTH              #increment WA
                jalr zero, t0, 0                    #run the code    

Essentially I don’t see what I can shave off there – there are just five instructions between them and no bloat.

I did find a small change in LOOP:

Here I was able to shave a single instruction off – but that isn’t going to make a big difference. But the other functions mentioned above seem lean otherwise.

The real task is to optimise the CONSTANT code. The compilation process generates the constant code on the fly – as each constant needs its own code to return a number to the stack and to return to the normal stream of execution:

This is the code for ROW-BYTE-SIZE:

   0x0000003ff7aa2418:	li	t0,0
   0x0000003ff7aa241c:	slli	t0,t0,0x20
   0x0000003ff7aa2420:	lui	t1,0x0
   0x0000003ff7aa2424:	slli	t1,t1,0x20
   0x0000003ff7aa2428:	srli	t1,t1,0x20
   0x0000003ff7aa242c:	li	t2,1
   0x0000003ff7aa2430:	slli	t2,t2,0xb
   0x0000003ff7aa2434:	add	t0,t0,t2
   0x0000003ff7aa2438:	li	t2,352
   0x0000003ff7aa243c:	add	t0,t0,t2
   0x0000003ff7aa2440:	add	t0,t0,t1
   0x0000003ff7aa2444:	addi	sp,sp,-8
   0x0000003ff7aa2448:	sd	t0,0(sp)
   0x0000003ff7aa244c:	li	t0,0
   0x0000003ff7aa2450:	slli	t0,t0,0x20
   0x0000003ff7aa2454:	lui	t1,0x10
   0x0000003ff7aa2458:	slli	t1,t1,0x20
   0x0000003ff7aa245c:	srli	t1,t1,0x20
   0x0000003ff7aa2460:	li	t2,1
   0x0000003ff7aa2464:	slli	t2,t2,0xb
   0x0000003ff7aa2468:	add	t0,t0,t2
   0x0000003ff7aa246c:	li	t2,236
   0x0000003ff7aa2470:	add	t0,t0,t2
   0x0000003ff7aa2474:	add	t0,t0,t1
   0x0000003ff7aa2478:	jr	t0

It’s long (96 bytes is long compared to most Forth primitives) because it’s designed to ensure it will work with all addresses and numbers so, eg at 0x3FF7AA2424 and 0x3FF7AA2428 numbers are shifted 20 bits to the left and then 20 bits to the right to avoid sign extension issues.

But there are ways to optimize this – the code is basically in two lumps: between 0x3FF7AA2418 – 0x3FF7AA2448 it reads a number into reigisters and stores it on the stack – this number is fixed at compile time.

Then between 0x3FF7AA244C and 0x3FF7AA2478 it first loads the address of the NEXT function into register t0 then jumps to that address. Again this address is known at compile time and is fixed.

So the obvious solution is to write both numbers as 64-bit constants into memory at compile time and then read them eg (I haven’t tested this code – just written it as I’ve gone along here).

mv t0, pc
addi t0, t0, 32
ld t1, 0(t0)
addi sp, sp, -8
sd t1, 0(sp)
ld t2, 8(t0)
jr t2
<alignment-buffer-4-byes>
CONSTANT-NUMBER
ADDRESS-OF-NEXT

If that works it would replace 25 instructions with just 8 – the next task is to see if it does!

If it does I will be a little sad – I worked quite hard to get that position/number independent code to work. But it also won’t be the first time on this project that I have found a simpler approach gives better and faster code.

*If* it works of course.

(I recommend this book for RISC-V assembly: The RISC-V Reader)