Optimising my Forth code

A while ago I ran some very simple pieces of Forth – essentially nothing more than empty loops – and was pleased to see that Riscyforth – my assembly-based Forth for RISC-V ran about 10% faster than GForth – the GNU project’s C-based Forth.

But a more serious benchmark – an O(n**3) matrix multiplication program – (see below) – tells a different story.

\ .( Loading Matrix Multiplication benchmark...) cr
\ NOTE: This version needs 0.5MB data space

\ A classical benchmark of an O(n**3) algorithm; Matrix Multiplication
\
\ Part of the programs gathered by John Hennessy for the MIPS
\ RISC project at Stanford. Translated to forth by  Marty Fraeman,
\ Johns Hopkins University/Applied Physics Laboratory.

\ MM forth2c doesn't have it !
: mybounds  over + swap ;

variable seed
\ : cell 8 ;

: initiate-seed ( -- )  74755 seed ! ;
: random  ( -- n )  seed @ 1309 * 13849 + 65535 and dup seed ! ;

300 constant row-size
row-size cells constant row-byte-size

row-size row-size * constant mat-size
mat-size cells constant mat-byte-size

align create ima mat-byte-size allot
align create imb mat-byte-size allot
align create imr mat-byte-size allot

: initiate-matrix ( m[row-size][row-size] -- )
  mat-byte-size mybounds do
    random dup 120 / 120 * - 60 - i !
  cell +loop
;

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
    r>
  loop
  >r 2drop r>
;

: main  ( -- )
  initiate-seed
  ima initiate-matrix
  imb initiate-matrix 
  imr ima mat-byte-size mybounds do
    imb row-byte-size mybounds do
      j i innerproduct over ! cell+ 
    cell +loop
  row-size cells +loop
  drop
;

This code runs somewhat more than 10% more slowly on Riscyforth than on GForth.

So how do I optimise it?

Unfortunately using a profiling tool isn’t an option as, as far I am aware, there isn’t one. Instead I have to run the code and just hit break a few times and see where it stops.

This is what this gives me:

(Here I have used ‘CONSTANT’ for the the ‘ROW-BYTE-SIZE’ constant for ease of presentation.)

NEXT and RUN are the obvious candidates, you might think, but they are already thin offerings:

NEXT:
                ld s8, 0(s7)                        #word address register takes content of next secondary
                addi s7, s7, ADDRWIDTH              #next secondary along
  
  RUN:
                ld t0, 0(s8)                        #extract first instruction address of primative
                addi s8, s8, ADDRWIDTH              #increment WA
                jalr zero, t0, 0                    #run the code    

Essentially I don’t see what I can shave off there – there are just five instructions between them and no bloat.

I did find a small change in LOOP:

Here I was able to shave a single instruction off – but that isn’t going to make a big difference. But the other functions mentioned above seem lean otherwise.

The real task is to optimise the CONSTANT code. The compilation process generates the constant code on the fly – as each constant needs its own code to return a number to the stack and to return to the normal stream of execution:

This is the code for ROW-BYTE-SIZE:

   0x0000003ff7aa2418:	li	t0,0
   0x0000003ff7aa241c:	slli	t0,t0,0x20
   0x0000003ff7aa2420:	lui	t1,0x0
   0x0000003ff7aa2424:	slli	t1,t1,0x20
   0x0000003ff7aa2428:	srli	t1,t1,0x20
   0x0000003ff7aa242c:	li	t2,1
   0x0000003ff7aa2430:	slli	t2,t2,0xb
   0x0000003ff7aa2434:	add	t0,t0,t2
   0x0000003ff7aa2438:	li	t2,352
   0x0000003ff7aa243c:	add	t0,t0,t2
   0x0000003ff7aa2440:	add	t0,t0,t1
   0x0000003ff7aa2444:	addi	sp,sp,-8
   0x0000003ff7aa2448:	sd	t0,0(sp)
   0x0000003ff7aa244c:	li	t0,0
   0x0000003ff7aa2450:	slli	t0,t0,0x20
   0x0000003ff7aa2454:	lui	t1,0x10
   0x0000003ff7aa2458:	slli	t1,t1,0x20
   0x0000003ff7aa245c:	srli	t1,t1,0x20
   0x0000003ff7aa2460:	li	t2,1
   0x0000003ff7aa2464:	slli	t2,t2,0xb
   0x0000003ff7aa2468:	add	t0,t0,t2
   0x0000003ff7aa246c:	li	t2,236
   0x0000003ff7aa2470:	add	t0,t0,t2
   0x0000003ff7aa2474:	add	t0,t0,t1
   0x0000003ff7aa2478:	jr	t0

It’s long (96 bytes is long compared to most Forth primitives) because it’s designed to ensure it will work with all addresses and numbers so, eg at 0x3FF7AA2424 and 0x3FF7AA2428 numbers are shifted 20 bits to the left and then 20 bits to the right to avoid sign extension issues.

But there are ways to optimize this – the code is basically in two lumps: between 0x3FF7AA2418 – 0x3FF7AA2448 it reads a number into reigisters and stores it on the stack – this number is fixed at compile time.

Then between 0x3FF7AA244C and 0x3FF7AA2478 it first loads the address of the NEXT function into register t0 then jumps to that address. Again this address is known at compile time and is fixed.

So the obvious solution is to write both numbers as 64-bit constants into memory at compile time and then read them eg (I haven’t tested this code – just written it as I’ve gone along here).

mv t0, pc
addi t0, t0, 32
ld t1, 0(t0)
addi sp, sp, -8
sd t1, 0(sp)
ld t2, 8(t0)
jr t2
<alignment-buffer-4-byes>
CONSTANT-NUMBER
ADDRESS-OF-NEXT

If that works it would replace 25 instructions with just 8 – the next task is to see if it does!

If it does I will be a little sad – I worked quite hard to get that position/number independent code to work. But it also won’t be the first time on this project that I have found a simpler approach gives better and faster code.

*If* it works of course.

(I recommend this book for RISC-V assembly: The RISC-V Reader)