A while ago I ran some very simple pieces of Forth – essentially nothing more than empty loops – and was pleased to see that Riscyforth – my assembly-based Forth for RISC-V ran about 10% faster than GForth – the GNU project’s C-based Forth.
But a more serious benchmark – an O(n**3) matrix multiplication program – (see below) – tells a different story.
\ .( Loading Matrix Multiplication benchmark...) cr \ NOTE: This version needs 0.5MB data space \ A classical benchmark of an O(n**3) algorithm; Matrix Multiplication \ \ Part of the programs gathered by John Hennessy for the MIPS \ RISC project at Stanford. Translated to forth by Marty Fraeman, \ Johns Hopkins University/Applied Physics Laboratory. \ MM forth2c doesn't have it ! : mybounds over + swap ; variable seed \ : cell 8 ; : initiate-seed ( -- ) 74755 seed ! ; : random ( -- n ) seed @ 1309 * 13849 + 65535 and dup seed ! ; 300 constant row-size row-size cells constant row-byte-size row-size row-size * constant mat-size mat-size cells constant mat-byte-size align create ima mat-byte-size allot align create imb mat-byte-size allot align create imr mat-byte-size allot : initiate-matrix ( m[row-size][row-size] -- ) mat-byte-size mybounds do random dup 120 / 120 * - 60 - i ! cell +loop ; : innerproduct ( a[row][*] b[*][column] -- int) 0 row-size 0 do >r over @ over @ * r> + >r swap cell+ swap row-byte-size + r> loop >r 2drop r> ; : main ( -- ) initiate-seed ima initiate-matrix imb initiate-matrix imr ima mat-byte-size mybounds do imb row-byte-size mybounds do j i innerproduct over ! cell+ cell +loop row-size cells +loop drop ;
This code runs somewhat more than 10% more slowly on Riscyforth than on GForth.
So how do I optimise it?
Unfortunately using a profiling tool isn’t an option as, as far I am aware, there isn’t one. Instead I have to run the code and just hit break a few times and see where it stops.
This is what this gives me:
(Here I have used ‘CONSTANT’ for the the ‘ROW-BYTE-SIZE’ constant for ease of presentation.)
NEXT and RUN are the obvious candidates, you might think, but they are already thin offerings:
NEXT: ld s8, 0(s7) #word address register takes content of next secondary addi s7, s7, ADDRWIDTH #next secondary along RUN: ld t0, 0(s8) #extract first instruction address of primative addi s8, s8, ADDRWIDTH #increment WA jalr zero, t0, 0 #run the code
Essentially I don’t see what I can shave off there – there are just five instructions between them and no bloat.
I did find a small change in LOOP:
Here I was able to shave a single instruction off – but that isn’t going to make a big difference. But the other functions mentioned above seem lean otherwise.
The real task is to optimise the CONSTANT code. The compilation process generates the constant code on the fly – as each constant needs its own code to return a number to the stack and to return to the normal stream of execution:
This is the code for ROW-BYTE-SIZE:
0x0000003ff7aa2418: li t0,0 0x0000003ff7aa241c: slli t0,t0,0x20 0x0000003ff7aa2420: lui t1,0x0 0x0000003ff7aa2424: slli t1,t1,0x20 0x0000003ff7aa2428: srli t1,t1,0x20 0x0000003ff7aa242c: li t2,1 0x0000003ff7aa2430: slli t2,t2,0xb 0x0000003ff7aa2434: add t0,t0,t2 0x0000003ff7aa2438: li t2,352 0x0000003ff7aa243c: add t0,t0,t2 0x0000003ff7aa2440: add t0,t0,t1 0x0000003ff7aa2444: addi sp,sp,-8 0x0000003ff7aa2448: sd t0,0(sp) 0x0000003ff7aa244c: li t0,0 0x0000003ff7aa2450: slli t0,t0,0x20 0x0000003ff7aa2454: lui t1,0x10 0x0000003ff7aa2458: slli t1,t1,0x20 0x0000003ff7aa245c: srli t1,t1,0x20 0x0000003ff7aa2460: li t2,1 0x0000003ff7aa2464: slli t2,t2,0xb 0x0000003ff7aa2468: add t0,t0,t2 0x0000003ff7aa246c: li t2,236 0x0000003ff7aa2470: add t0,t0,t2 0x0000003ff7aa2474: add t0,t0,t1 0x0000003ff7aa2478: jr t0
It’s long (96 bytes is long compared to most Forth primitives) because it’s designed to ensure it will work with all addresses and numbers so, eg at 0x3FF7AA2424 and 0x3FF7AA2428 numbers are shifted 20 bits to the left and then 20 bits to the right to avoid sign extension issues.
But there are ways to optimize this – the code is basically in two lumps: between 0x3FF7AA2418 – 0x3FF7AA2448 it reads a number into reigisters and stores it on the stack – this number is fixed at compile time.
Then between 0x3FF7AA244C and 0x3FF7AA2478 it first loads the address of the NEXT function into register t0 then jumps to that address. Again this address is known at compile time and is fixed.
So the obvious solution is to write both numbers as 64-bit constants into memory at compile time and then read them eg (I haven’t tested this code – just written it as I’ve gone along here).
mv t0, pc addi t0, t0, 32 ld t1, 0(t0) addi sp, sp, -8 sd t1, 0(sp) ld t2, 8(t0) jr t2 <alignment-buffer-4-byes> CONSTANT-NUMBER ADDRESS-OF-NEXT
If that works it would replace 25 instructions with just 8 – the next task is to see if it does!
If it does I will be a little sad – I worked quite hard to get that position/number independent code to work. But it also won’t be the first time on this project that I have found a simpler approach gives better and faster code.
*If* it works of course.
(I recommend this book for RISC-V assembly: The RISC-V Reader)