Adding dynamic linking to Riscyforth


This is one of those blogs one writes to clarify inchoate thoughts, so it’s sorry-not-sorry if it starts to wander around…

Last week I got to an important point with Riscyforth, my Forth for RISC-V single board computers: I tagged my first release (v0.1).

I did that because I had, finally, managed to cover all of the core and extended core code of the Forth 2012 standard. Essentially that means something written using words from the core of the language should compile and run (and give correct results).

But merely covering the core definitions doesn’t deliver a system that many real-world users will want to play with except out of curiosity. No floating points maths, no control over screen output beyond, essentially, printing line by line (though there are a few additions here like coloured output) and certainly no graphics.

But I also don’t want to build an ever-larger monolithic system. Users should be able to pick and choose which parts of the extended language they get (as well, of course, being free to extend it in Forth itself).

That is going to require some form of dynamic loading and linking. Whilst all the tools are now available for anyone (with a screw-loose) to write their own extended floating point words using the integer primitives, it’s much more sensible to provide words (written in assembly) that call the library code provided by the operating system.

My initial thought was that there should be a riscyforth.conf file in the Riscyforth distribution that would list what modules are loaded on startup. But I now think a more flexible way is to provide a new Forth word – LOADMODULE ("<name>" -- flag) say – that would specify a module that could be loaded. This word could be used at anytime, but if we put it in a short Forth file/program – eg startup.fth – then Riscyforth could be made to execute this at start-up time and so users would edit that program/file to fix what they started with.

But how to implement? The best way seems to be to supply object files (along with the source, this is a GPL-licenced project) and have mmap load them into memory. My code already uses mmap to create the space where new words can be written, so the theory seems sound enough and I have found this on the web which says it can be done. But it might also be a bit of a challenge – but then again, it wouldn’t be any fun if it wasn’t.

Getting the POSTPONE word to work in Forth

RISCYFORTH in action

I have struggled with this issue for several weeks now and I didn’t find the explanations in Starting Forth or the 2012 Forth standard or even the GForth documentation particularly enlightening, so I wanted to write this to – hopefully – add some clarity to anyone else trying to get this right in their own Forth implementation.

The standard says this of POSTPONE:

Compilation:

“<spaces>name” — )

Skip leading space delimiters. Parse name delimited by a space. Find name. Append the compilation semantics of name to the current definition. An ambiguous condition exists if name is not found.

But what does that actually mean? I think there are three rules you need to apply (of which more later), but first you need to think about where they are applied. Typically you might have something like this:

: ENDIF POSTPONE THEN ; IMMEDIATE

: EXAMPLE IF ." Say something" ELSE ." Say something different" ENDIF ;

ENDIF here replacing the modern standard THEN (some older Forth code may contain the ENDIF word, and our ENDIF should be a perfect replacement).

Now, we have defined ENDIF in its own word but we want it to act inside the word EXAMPLE – in this case we want it to ensure that what we might call the contractual obligations of THEN (in this case to inform ELSE clause where code following the ‘true’ path should jump to when it hits ELSE).

Now, defining our ENDIF word as IMMEDIATE ensures it is activated when we are compiling EXAMPLE, but it all it did was call itself then it would do literally nothing (as in Riscyforth, in any case, THEN in execution is a NOP).

What we actually want to see happen is that compilation actions of THEN are activated and that their effects are seen in EXAMPLE.

In this case that means calculating and inserting the jump-to address for the ELSE. Exactly what a THEN would do if it were there in EXAMPLE.

(The badging of ENDIF as IMMEDIATE then gives us a second boost – because that labelling means it doesn’t get called when EXAMPLE is executed, so we don’t have to worry about it trying to compile in those actions every time EXAMPLE is called.)

In Riscyforth there are a number of words – currently LITERAL, DOES>, ABORT”, .”, S”, , [‘], COMPILE,, IS, TO, [CHAR], ACTION-OF, AGAIN, UNTIL, DO, ?DO, LOOP, +LOOP, -LOOP, UNLOOP, LEAVE, RECURSE, I, J, EXIT, IF, ELSE, THEN, OF, WHILE, REPEAT, and ; (though others will probably need to be added as the code is matured) – that need special handling on compilation and that has to be reflected if they are POSTPONE’d with those special handling routines compiled into the colon word being worked – for the other, “ordinary”, words what has to happen is that a reference to them is what gets compiled in – here’s another example:

: -IF POSTPONE 0= POSTPONE IF ; IMMEDIATE

In this case we’ve defined a new word -IF which works in the opposite sense (Boolean-wise) to the standard IF, i.e, what was FALSE on IF is treated as TRUE in -IF and so on. The key here is that, for example:

: WASFALSE -IF ." Result was FALSE " ELSE ." Result was TRUE " THEN ;

That the 0= gets compiled into the definition of WASFALSE and not left to languish inside -IF. Because – again – -IF, being marked as IMMEDIATE, never gets called again after the initial invocation on compilation – what it leaves behind is what gets called.

So – those three rules:

  1. If a word is marked as IMMEDIATE then POSTPONE strips it of that character and it becomes JAW – just another word. These words are not compiled into a new definition, they are simply executed when they are reached.
  2. If a word requires special handling on compilation that special handling should be repeated when it is POSTPONE’d.
  3. All POSTPONE’d words that are not immediate should be compiled into the current colon definition being worked on (that phrase “the current definition” from the standard is extremely confusing).





More on the Riscyforth versus Gforth shootout


This is a further post about optimising Riscyforth, my Forth language for RISC-V single board computers.

Riscyforth is a 64-bit Forth which runs on top of Linux – my test system is the Sipeed Nezha SBC – which runs a 1 GHz single core CPU (so 1 cycle is one nano-second). I’m testing it against GForth, the GNU Project’s Forth.

Whilst Riscyforth is written in RISC-V assembly, GForth is mainly Forth on top of C. For those familiar with Forth techniques – Riscyforth is a very traditional indirect threaded implementation, GForth uses a mixture of this and (faster) direct threaded code.

GForth also sticks to the Forth standard of 32-bit numbers as the default, Riscyforth uses 64 bit numbers throughout.

GForth is a much more sophisticated (and portable) environment than Riscyforth – and certainly my suggestion/intention has never been to claim I will write a better Forth than GForth – the GForth effort goes back to the 1990s, Riscyforth goes back to 28 December 2020!

The two key posts before this are:

The quick summary of the material those two posts is that I could make significant timing savings in my code with some optimisation steps that replaced correct but long-winded compile-time generated code that loaded registers with fixed (at compile-time) numbers via a sequence of adds and bit shifts, with code that loaded those values from a known memory location.. However those optimisations were not enough to close the gap on performance with the GNU Project’s GForth – which runs matrix.fth about 12% faster.

My code started out at taking about 9.5 seconds and I have got that down to about 8.5 seconds, whilst GForth is taking about 7.5 seconds.

Not satisfied with leaving it there I wrote some code to count cycles taken in executing Forth words or groups of words, and looked closer at what might be fixable.

		CODEHEADER DEBUGIN, Q, 0x0
		#(--)
		#debug word
		la t0, CYCLESTART
		rdcycle t1
		sd t1, 0(t0)
		tail NEXT
		
		CODEHEADER DEBUGOUT, DEBUGIN, 0x0
		#(--)
		#debug word
		rdcycle t0
		la t1, CYCLESTART
		ld t2, 0(t1)
		bgt t0, t2, debugout_normal
		li t3, -1
		sub t4, t3, t2
		add t4, t4, t0
		j debugout_sum
  debugout_normal:
		sub t4, t0, t2
  debugout_sum:
		la t5, CYCLECOUNT
		ld t0, 0(t5)
		add t1, t0, t4
		sd t1, 0(t5)
		la t6, CYCLEPINGS
		ld t2, 0(t6)
		addi t2, t2, 1
		sd t2, 0(t6)
		tail NEXT

		CODEHEADER DEBUGRESULTS, DEBUGOUT, 0x0
		#(-- n n n)
		la t0, CYCLECOUNT
		la t1, CYCLEPINGS
		ld t2, 0(t0)
		ld t3, 0(t1)
		div t4, t2, t3
		addi sp, sp, -24
		sd t2, 0(sp)
		sd t3, 8(sp)
		sd t4, 16(sp)
		sd zero, 0(t0)
		sd zero, 0(t1)
		la t2, CYCLESTART
		sd zero, 0(t2)
		tail NEXT

DEBUGIN records the cycle count at the start of some block of code, whilst DEBUGOUT calculates the time since the previous call to DEBUGIN, adds that to a total of elapsed times and increments a counter. DEBUGRESULTS then places, on the stack, the average time per IN – OUT cycle, the total number of samples and the total elapsed cycles.

Looking at the Forth code, the main word is what we call to run the program:

It initialises two 300 x 300 matrices (each initialisation taking about 53 million cycles) and then executes a loop and a nested loop: each loop runs 300 times, so the inner loop is executed 90000 times. The outer loop is recorded as taking just over 26 million cycles per execution whilst each inner loop takes about 87000 cycles per iteration.

The code to measure the cycles has an effect on the measurement so we can only be approximate here – eg while the inner loop is measured as taking 87000 cycles per iteration, each execution of innerproduct is measured as requiring around 88000 cycles (ie, longer than the code that calls it – an obvious contradiction). The test code itself is measured as taking about 15 cycles per execution – not much in itself, but rising to 409 million total cycles when called 27 million times inside innerproduct.

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
    r>
  loop
  >r 2drop r>
;

In any case it is clear that innerproduct is where the real work is done:

Inside innerproduct lurks another loop, again called 300 times, meaning that the code inside this loop gets called 300 x 300 x 300 – 27 million – times: the graphic above shows the approximate (having accounted for the cost of calling the instrumentation code) total cycle count for each line of that loop.

Right at the start of this (see the first post linked above), the big cycle burner was the constant code: row-byte-size here, so it’s good that no longer seems to be the case.

It’s hard to be precise about this (a test of the row-size constant in main suggested, accounting for the cost of the instrumentation code, that it took about 269 cycles per execution but that looks to be a significant over-estimate as that would imply, if row-byte-size took the same time as the other constant (as it should) that 27 million executions of it alone should take about 7 billion cycles. A better guide seems to be that the nine words of the first line take a little bit less than twice the cycles/time of the five words of the second line – suggesting that each word takes on average a roughly similar time.

The final line in innerproduct is not inside the innermost loop and so is only called 90,000 times and takes, roughly, 8 million cycles.

It’s not possible to measure the cost of the loop control code using my simple instrumentation (as execution is not linear – all the work after the loop initialisation is done at the loop word) but as the total execution time of innerproduct (as measured inside innerproduct itself) is estimated at 7.6 billion cycles it seems unlikely the do ... loop code is particularly expensive.

This is all good news at one level: there appear to be no bits of code here chewing up cycles outrageously (each of the 14 words of the first two lines take roughly 19 cycles to execute, including the time taken to move from one word to another via the Forth threading).

But at level another it represents a bit of a dead end for efforts to optimise the code. All the words being executed here are pared down to the minimum and there is no room to squeeze anything out of them.

I covered that before in the initial posts but here’s a quick reminder/example:

		CODEHEADERZ TOR, >R, TOR2, 0x01 #>R
		POP t0
		addi s9, s9, -STACKOFFSET
		sd t0, 0(s9)
		tail NEXT

		CODEHEADERZ RFROM, R>, TOR, 0x01 #R>
		ld t0, 0(s9)
		PUSH t0
		addi s9, s9, STACKOFFSET
		tail NEXT

The >R and R> words are called repeatedly in the loop. They use the POP and PUSH macros which add or remove a 64 bit register from the stack:

.macro PUSH register
        addi sp, sp, -8
        sd  \register, 0(sp)
.endm

.macro POP register
        ld \register, 0(sp)
        addi sp, sp, 8
.endm

While the s9 register is used to manage the return stack (ie as a stack pointer to that stack) – so you can see these words just transfer a 64 bit value to and from two different stacks and there is just nothing that can be done to save cycles here (or if there is, I’d very much welcome being told about it).

So, it seems, GForth is the winner here – a Forth based on optimised C is not just portable but would appear to be faster too.

But, actually, I have some counter evidence.

The Forth program factoriel.f calculates the numerals of factorials. In Riscyforth I can calculate the factorial of 10000 (ie, 10000!) in about 26 seconds:

But GForth cannot manage it at all – though if I double the number of cells allocated in GForth (which means both systems have the same amount of resources due to the difference between 64 and 32 bits) it will work – but it takes around 36 seconds.

GForth will calculate the factorial of smaller numbers faster than Riscyforth though: if we try 3200! (close to the limit GForth can manage with the original settings for cell numbers) then GForth takes about 7 seconds and Riscyforth takes about 8 seconds.

Taken together these results suggest to me that at least there is nothing broken in Riscyforth. It’s about as fast as it could be with 64 bit arithmetic and the indirect threaded model.

Forth code optimisation revisited


I have spent much of the last week working on optimising my Forth code, or more accurately, optimising the RISC-V assembly which actually does the work of executing the Forth words. (See this for the initial discussion.)

I cannot properly profile the assembly itself – my efforts to compile the code using gcc -pg aren’t working (I’ll need to investigate if they can be made to work) because of linker issues, so I am stuck with “poor man’s profiling” – that is to say, just interrupting the code over repeated runs, noting where it is and building up a sample from that.

So with 90 samples from the completely unoptimised code the share of hit count looked like this:

Unoptimised hit count

We can see that “RUN” and “NEXT” (which aren’t actual Forth words but assembly labels) dominate but – as I discussed here, there really isn’t any room to optimise them – they are about as lean as we can get already.

What I did do was target my efforts on the code that handles constants (here marked as “ROW-BYTE-SIZE” and “ROW-SIZE”. The constant handling code used two generic routines to ensure that fixed sized blocks of code placed a fixed value on the stack and then passed control to “NEXT”.

The code was long-winded because it needed to handle any values without being tripped up by sign extension. For the jump code a number had to be loaded into a register and then a jump to the value of the register executed. The code was again designed to handle any value for the jump-to address – but that meant that, while NEXT was always at a 16 bit address, code to handle a full 64 bit address was being executed.

The simple thing to do was to place the needed values (both the value of the constant and the address of NEXT) into memory at a known address at compile time and then just load those addresses at run time.

And, in fact, this method could be used in some other places where I had relied on my hand crafted “build a 64 bit number through register adds” code – and it saved a lot of space and instructions.

So now (122 samples this time), the (crude, remember) profile count looks like this:

We can see that “NEXT” and “RUN” (NEXT is called by every compiled word at the end of execution and then passes control to RUN) are maybe even more dominant and ROW-BYTE-SIZE (which is called 90,000 27,000,000 times) has shrunken dramatically (and ROW-SIZE has gone completely).

The optimisation really does speed the code up too – execution time falls from about 9.5 seconds to about 8.5 speconds with all the various tweaks I’ve added.

But still the (32 bit) GForth version runs around 1 second faster than the (64 bit) Riscyforth version. Now, I didn’t start this project (Riscyforth that is) with the aim of writing the fastest-ever Forth but a 13% slow down does feel like a lot even if you are getting double the bits for it: so where can I go?

Some code is called a lot (eg., @/FETCH which has a body of just four instructions) but it is already (afaics) fully optimised:

		CODEHEADERZ FETCH, @,  BASE, 0x01
		ld t0, 0(sp)
		ld t1, 0(t0)
		sd t1, 0(sp)
		tail NEXT

PLUS/+ is bigger but it too looks as thin as it can be to me (loads the top two numbers off the stack, adds them, stores back on the stack and shrinks the stack to match the loss of one value):

                CODEHEADERZ PLUS, +, DOT, 0x01
		ld a0, 0(sp)
		ld a1, 8(sp)
                add t0, a0, a1
		sd t0, 8(sp)
		addi sp, sp, 8
                tail NEXT

In fact everything looks pretty lean with the possible exception of the LOOP/+LOOP code, which share common execution paths:

		CODEHEADER _PLUSLOOP, _UNLOOP, 0x0
		ld t3, 0(sp)
		addi sp, sp, 8
		ld t0, 0(s10)
		add t0, t0, t3
		bgt t3, zero, loop_plusloop_entry
		# decrement on this path
		ld t1, 8(s10)
		ld t2, 16(s10)
		bge t1, t0, loop_over		#reverse check
		sd t0, 0(s10)
		mv s7, t2
		tail NEXT

		CODEHEADER _LOOP, _PLUSLOOP, 0x0
		ld t0, 0(s10)			#current value
		addi t0, t0, 1
  loop_plusloop_entry:
		ld t1, 8(s10)			#limit
		bge t0, t1, loop_over
		ld s7, 16(s10)			#return point
		sd t0, 0(s10)
		tail NEXT
  loop_over:
		addi s10, s10, 24
		tail NEXT

I have implemented +LOOP in a non-standard way – assuming that a negative increment implies a countdown – that should probably go (and instead I should add a non-standard -LOOP word in the same way as GForth does) but at best that is going to save one comparison and one jump (though cache-miss effects may make the jump more costly than it might look on the surface? There is a decent amount of stuff in the function header which might cause some cache misses.).

But, in general, it looks as though it will be hard to squeeze much more out of this.

Won’t hurt to try I suppose.

The actual optimised code


I discovered when I tried to write my optimised code from last night that there was a problem with it – the various register add instructions use a 5 bit number to number the registers being accessed – allowing access to registers 0 to 31.

But the program counter is register number 32, so just can’t be used in these commands. After a bit of head scratching I came up with a solution:

aui t0, 0
ld t1, 24(t0)
addi sp, sp, -8
sd t1, 0(sp)
ld t1, 32(t0)
jr t1

That aui t0, 0 adds 0 to the value in the pc and stores it in t0, so is the equivalent of mv t0, pc – and then by just taking an offset straight from t0 I also saved another instruction.

Does it work? Yes, it compiles and runs.

But does it work by saving time? Yes – knocks about half a second off the benchmark execution time: roughly a 6% saving, which isn’t bad. But my code is still rather adrift of GForth, so I need to look deeper.

Optimising my Forth code


A while ago I ran some very simple pieces of Forth – essentially nothing more than empty loops – and was pleased to see that Riscyforth – my assembly-based Forth for RISC-V ran about 10% faster than GForth – the GNU project’s C-based Forth.

But a more serious benchmark – an O(n**3) matrix multiplication program – (see below) – tells a different story.

\ .( Loading Matrix Multiplication benchmark...) cr
\ NOTE: This version needs 0.5MB data space

\ A classical benchmark of an O(n**3) algorithm; Matrix Multiplication
\
\ Part of the programs gathered by John Hennessy for the MIPS
\ RISC project at Stanford. Translated to forth by  Marty Fraeman,
\ Johns Hopkins University/Applied Physics Laboratory.

\ MM forth2c doesn't have it !
: mybounds  over + swap ;

variable seed
\ : cell 8 ;

: initiate-seed ( -- )  74755 seed ! ;
: random  ( -- n )  seed @ 1309 * 13849 + 65535 and dup seed ! ;

300 constant row-size
row-size cells constant row-byte-size

row-size row-size * constant mat-size
mat-size cells constant mat-byte-size

align create ima mat-byte-size allot
align create imb mat-byte-size allot
align create imr mat-byte-size allot

: initiate-matrix ( m[row-size][row-size] -- )
  mat-byte-size mybounds do
    random dup 120 / 120 * - 60 - i !
  cell +loop
;

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
    r>
  loop
  >r 2drop r>
;

: main  ( -- )
  initiate-seed
  ima initiate-matrix
  imb initiate-matrix 
  imr ima mat-byte-size mybounds do
    imb row-byte-size mybounds do
      j i innerproduct over ! cell+ 
    cell +loop
  row-size cells +loop
  drop
;

This code runs somewhat more than 10% more slowly on Riscyforth than on GForth.

So how do I optimise it?

Unfortunately using a profiling tool isn’t an option as, as far I am aware, there isn’t one. Instead I have to run the code and just hit break a few times and see where it stops.

This is what this gives me:

(Here I have used ‘CONSTANT’ for the the ‘ROW-BYTE-SIZE’ constant for ease of presentation.)

NEXT and RUN are the obvious candidates, you might think, but they are already thin offerings:

NEXT:
                ld s8, 0(s7)                        #word address register takes content of next secondary
                addi s7, s7, ADDRWIDTH              #next secondary along
  
  RUN:
                ld t0, 0(s8)                        #extract first instruction address of primative
                addi s8, s8, ADDRWIDTH              #increment WA
                jalr zero, t0, 0                    #run the code    

Essentially I don’t see what I can shave off there – there are just five instructions between them and no bloat.

I did find a small change in LOOP:

Here I was able to shave a single instruction off – but that isn’t going to make a big difference. But the other functions mentioned above seem lean otherwise.

The real task is to optimise the CONSTANT code. The compilation process generates the constant code on the fly – as each constant needs its own code to return a number to the stack and to return to the normal stream of execution:

This is the code for ROW-BYTE-SIZE:

   0x0000003ff7aa2418:	li	t0,0
   0x0000003ff7aa241c:	slli	t0,t0,0x20
   0x0000003ff7aa2420:	lui	t1,0x0
   0x0000003ff7aa2424:	slli	t1,t1,0x20
   0x0000003ff7aa2428:	srli	t1,t1,0x20
   0x0000003ff7aa242c:	li	t2,1
   0x0000003ff7aa2430:	slli	t2,t2,0xb
   0x0000003ff7aa2434:	add	t0,t0,t2
   0x0000003ff7aa2438:	li	t2,352
   0x0000003ff7aa243c:	add	t0,t0,t2
   0x0000003ff7aa2440:	add	t0,t0,t1
   0x0000003ff7aa2444:	addi	sp,sp,-8
   0x0000003ff7aa2448:	sd	t0,0(sp)
   0x0000003ff7aa244c:	li	t0,0
   0x0000003ff7aa2450:	slli	t0,t0,0x20
   0x0000003ff7aa2454:	lui	t1,0x10
   0x0000003ff7aa2458:	slli	t1,t1,0x20
   0x0000003ff7aa245c:	srli	t1,t1,0x20
   0x0000003ff7aa2460:	li	t2,1
   0x0000003ff7aa2464:	slli	t2,t2,0xb
   0x0000003ff7aa2468:	add	t0,t0,t2
   0x0000003ff7aa246c:	li	t2,236
   0x0000003ff7aa2470:	add	t0,t0,t2
   0x0000003ff7aa2474:	add	t0,t0,t1
   0x0000003ff7aa2478:	jr	t0

It’s long (96 bytes is long compared to most Forth primitives) because it’s designed to ensure it will work with all addresses and numbers so, eg at 0x3FF7AA2424 and 0x3FF7AA2428 numbers are shifted 20 bits to the left and then 20 bits to the right to avoid sign extension issues.

But there are ways to optimize this – the code is basically in two lumps: between 0x3FF7AA2418 – 0x3FF7AA2448 it reads a number into reigisters and stores it on the stack – this number is fixed at compile time.

Then between 0x3FF7AA244C and 0x3FF7AA2478 it first loads the address of the NEXT function into register t0 then jumps to that address. Again this address is known at compile time and is fixed.

So the obvious solution is to write both numbers as 64-bit constants into memory at compile time and then read them eg (I haven’t tested this code – just written it as I’ve gone along here).

mv t0, pc
addi t0, t0, 32
ld t1, 0(t0)
addi sp, sp, -8
sd t1, 0(sp)
ld t2, 8(t0)
jr t2
<alignment-buffer-4-byes>
CONSTANT-NUMBER
ADDRESS-OF-NEXT

If that works it would replace 25 instructions with just 8 – the next task is to see if it does!

If it does I will be a little sad – I worked quite hard to get that position/number independent code to work. But it also won’t be the first time on this project that I have found a simpler approach gives better and faster code.

*If* it works of course.

(I recommend this book for RISC-V assembly: The RISC-V Reader)

Riscyforth: an update


Last Christmas I started work on a Forth for RISC-V single board computers (SBCs): Riscyforth.

Riscyforth is written in RISC-V assembly, with the aim of providing a good balance between the ease of use that comes from a programming language that can both be run in an interpreted “immediate” mode and compiled if speed is the priority.

Forth is a stack based language – commands operate on what’s on top of the stack, hence:

2 2 +

Computes 2 + 2 and so on. It can sometimes make the code hard to read, but it’s also a powerful and extensible language – very much out of fashion these days but also very hacker (as in coder/experimenter) friendly.

At the time I began no SBCs existed, but they were expected. Since then, one has turned up – the Nezha – while the one we were all expecting, the Beagle V, was cancelled but it is looking better for 2022.

Progress on Riscyforth has been slower than I probably expected, certainly slower than reading the book on “threaded interpreted languages“- my initial inspiration – led me to expect: but the delays in hardware have also given me more time.

So how does it look?

Well, it’s now pretty mature – I have implemented a substantial part of the Forth 2012 core specification (and some other extensions too), but there is still a lot of work to be done:

  • Output is still pretty basic – you get lines of text and there is no support for nCurses type output, never mind graphics.
  • Some basic structures are still missing (such as the Forth equivalent of switch/case – but I have recently finished a restructuring of code/ pay down of technical debt that makes all this much easier to implement.
  • Despite the initial idea of writing this to give developers faster access to the hardware, all the work has gone into the core language and no hardware specific extensions have yet been written.

But it works, and works well. A couple of crude tests show it is doing basic looping about 10 – 20% faster than GForth, the GNU Project’s Forth compiler (which works well for RISC-V) – though you are getting a lot else besides with GForth.

If you are at all interested your comments and – perhaps – participation in the project would be very welcome.

Riscyforth: the work goes on


So after my last post somebody give me a “very poor” vote – I am consoling myself with the idea that was because they are really very keen for an assembly-based Forth to run on their RISC-V SBC and not because they thought it was an incoherent ramble.

In any case I have continued to work on Riscyforth – adding to its functionality and edging closer all the time to matching the Forth standard – getting close enough now to describe it as “Forth” and not just “Forth-like”.

Eight months ago, when I started, reading an online copy of “Threaded Interpretive Languages” which promised you could write your own TIL in a few weeks, I never considered using anything other than assembly to write it – just because I’m so old I thought that’s was the whole point of Forth – an interpreted language but written in machine code for speed.

I’ve learnt a lot on the way: including not to believe you can get from scratch to writing a working language in a few weeks – especially not in the world of virtual memory. Though I dare say, in the extremely unlikely event of someone commissioning me to write them a Forth from the ground up, I wouldn’t take eight months to get this far either next time.

One thing I have definitely learnt is the power and importance of unit testing – it matters quite a lot when you are putting together something on your own and can often make one assumption too many – a decent unit test soon clears you of that misapprehension.

What I haven’t got includes any users (never mind co-developers) or any useful or interesting programs.

On the former, RISC-V single board computers (SBCs) were promised to be widely available just about now, but the cancelling of the Beagle V project means that the more expensive Nezha is all that’s out there for now (I have one and it’s what is running in the image above). Compared to the ultra-cheap ARM boards (or even the high-end examples like some of the Raspberry Pi SBC kits) it’s fair to say RISC-V is not yet at the races, and the only users are likely to be people like me – hoping and waiting for the revolution to get here.

I picked Forth because the hope was it would give the first generation of RISC-V SBC hackers an easy way to get at their hardware and that is still the case, but there has to be some hardware to get at.

So I still have the sense I am working away on something that nobody else might ever want to use (the fact that it’s written in assembly also makes it fundamentally non-portable – I just cannot say ‘oh well let’s switch to the Raspberry Pi’!).

But if you are interested, current features of Riscyforth include:

  • Support for deeply nested loops and conditionals
  • Write your own words using : and ; (ie it is easy to extend the language by writing secondary words out of the existing primary words)
  • A custom memory allocator for arbitrary blocks of memory
  • Support for VARIABLE and CONSTANT
  • All (or almost all?) the stack operations you would expect
  • Hex, octal, decimal and binary support

The main thing currently missing is support for CREATE and a data space, but now I’ve grappled with the memory allocator (for general memory allocation) I will move on that. The listing of the test file might give you some ideas of what’s there, but even that is now behind where the code base is (yes, I know this isn’t how test-based development is meant to work!)

\ Unit tests


\ Stack operations tests

: testOVER2
." Testing OVER2 " 5 SPACES
10 30 50 90 70 OVER2
50 = SWAP 30 = AND IF ." OVER2 passed " else ." OVER2 FAILED " then cr
;

: testDrop2
." Testing DROP2 " 5 spaces
99 2 3 4
drop2
2 = if ." DROP2 passed " else ." DROP2 FAILED " then cr
;

: testSquare
." Testing SQUARE " 5 spaces
7 square
49 = if ." SQUARE passed " else ." SQUARE FAILED " then cr
;

: testCube
." Testing CUBE " 5 spaces
5 cube
125 = if ." CUBE passed " else ." CUBE FAILED " then 5 spaces
3 cube
28 = if ." CUBE FAILED " else ." CUBE passed " then cr
;

: testNIP2
." Testing NIP2 " 5 spaces
10 20 30 40 50 NIP2
50 = SWAP 40 = AND SWAP 10 = AND if ." NIP2 passed " else ." NIP2 FAILED " then cr 
;

: testDUP2
." Testing DUP2 " 5 SPACES
900 800 700 DUP2
700 = SWAP 800 = AND SWAP 700 = AND SWAP 800 = AND if ." DUP2 passed" else ." DUP2 FAILED " then cr
;

: testtuck2
." Testing TUCK2 " 5 spaces
1 2 3 4 5 6 TUCK2
6 = swap 5 = and swap 4 = and swap 3 = and swap 6 = and swap 5 = and if ." TUCK2 passed" else ." TUCK2 FAILED " then cr
;

: testswap2
." Testing SWAP2 " 5 spaces
1 2 3 4 5 6 7 8 SWAP2
6 = swap 5 = and swap 8 = and if ." SWAP2 passed " else ." SWAP2 FAILED " then cr
;

: testrot2
." Testing ROT2 " 5 spaces
99 2 3 4 5 6 ROT2
2 = swap 99 = and swap 6 = and if ." ROT2 passed " else ." ROT2 FAILED " then cr
;

: TESTDUP
." Testing DUP " 5 spaces
34 45 DUP
45 = SWAP 45 = AND swap 34 = AND IF ." DUP passed " else ." DUP FAILED " then cr ;

: TESTBL
." Testing BL" 5 spaces BL 32 = if ." BL passed " else ." BL FAILED " then cr ;

: TESTDEPTH
." Testing DEPTH" 5 spaces
depth dup 0 < IF ." WARNING: Stack in negative territory " THEN 10 20 rot depth swap - 3 = IF ." DEPTH passed " ELSE ." DEPTH FAILED " then cr ;

: TESTINVERT
." Testing INVERT " 5 spaces
-1 INVERT IF ." INVERT FAILED " ELSE hex 0xF0F0F0F00F0F0F35 INVERT 0xF0F0F0FF0F0F0CA = IF ." INVERT passed " ELSE ." INVERT FAILED " then then decimal cr ;

\ Basic tests

: VERIFYTYPEPROMPT
." Verifying TYPEPROMPT " cr
TYPEPROMPT cr
;

: VERIFYGETNEXTLINE_IMM
." Verifying GETNEXTLINE_IMM - please press RETURN only " cr
GETNEXTLINE_IMM cr
;

: VERIFYOK
." Verifying OK " cr
OK cr
;

: VERIFYTOKENIZE_IMM
." Verifying TOKENIZE_IMM " cr
TOKENIZE_IMM
;

: VERIFYSEARCH
." Verifying SEARCH " cr
SEARCH
;

: TESTHEX
." Testing HEX " 5 SPACES
HEX 0x10 0xFF + DUP
0x10F = IF ." HEX passed with output 0x10F = " . ELSE ." HEX FAILED with output 0x10F =  " . then cr
\ ensure other tests keep testdup2
DECIMAL
;

: TESTDECIMAL
." Testing DECIMAL " 5 SPACES
DECIMAL 20 DUP
20 = IF ." DECIMAL passed wth output 20 = " DUP . ." = " HEX . DECIMAL ELSE ." DECIMAL FAILED with output 20 = " DUP . ." = " HEX . DECIMAL THEN CR
;

: TESTOCTAL 
." Testing OCTAL " 5 SPACES OCTAL 20 DUP DECIMAL 16 = IF ." OCTAL passed with output 20o = " DUP OCTAL . ." = " DECIMAL .
ELSE ." OCTAL FAILED with 20o = " DUP OCTAL . ." = " DECIMAL . THEN CR ;

: VERIFYBINARY 
." Verifying BINARY  - " 1 2 4 8 16 32 64 128 256 512
BINARY ." powers of 2 from 9 to 0 in binary... " cr
. cr . cr . cr . cr . cr . cr . cr . cr . cr . cr DECIMAL ;

: VERIFYDOT
." Verifying DOT " 5 spaces 0 1 2 3 4 5
." ... should see countdown from 5 to 0: " . . . . . . CR ;

: TESTADD
." Testing ADD " 5 SPACES 900 -899 +
IF ." ADD passed " ELSE ." ADD FAILED " THEN CR ;

: testMUL
." Testing MUL " 5 spaces
5 5 5 * *
5 cube
= if ." MUL passed " else ." MUL FAILED " then cr
;

: TESTDIV
." Testing DIV " 5 SPACES 99 11 / 101 11 / * 81 =
IF ." DIV passed " else ." DIV FAILED " then cr ;

: TESTSUB
." Testing SUB " 5 spaces 
75 22 - 53 = IF ." SUB passed " else ." SUB FAILED " then cr ;

: TESTPLUS1
." Testing 1+ " 5 SPACES
10 1+ 11 = IF ." 1+ passed " ELSE ." 1+ FAILED " THEN CR ;

: TESTPLUS2
." Testing 2+ " 5 SPACES
10 2+ 12 = IF ." 2+ passed " ELSE ." 2+ FAILED " THEN CR ;

: TESTMINUS1
." Testing 1- " 5 spaces
-1 1- -2 = IF ." 1- passed " ELSE ." 1- FAILED " THEN CR ;

: TESTMINUS2
." Testing 2- " 5 SPACES
10 2- 8 = IF ." 2- passed " ELSE ." 2- FAILED " THEN CR ;

: TESTUNDERPLUS
." Testing UNDERPLUS" 5 spaces
10 15 20 underplus 30 = if ." UNDERPLUS passed" else ." UNDERPLUS FAILED" then cr ;

: TESTMOD
." Testing MOD" 5 spaces
13 7 mod 6 = if ." MOD passed" else ." MOD FAILED" then cr ;

: TESTSLMOD
." Testing /MOD" 5 spaces
13 7 /mod 1 = swap 6 = and if ." /MOD passed " else ." /MOD FAILED" then cr ;

: TESTNEGATE
." Testing NEGATE" 5 spaces 13 negate -13 =
if ." NEGATE passed" else ." NEGATE FAILED" then cr ;

: TESTABS
." Testing ABS" 5 spaces -13 abs 13 =
if ." ABS passed" else ." ABS FAILED" then cr ;

: TESTMINMAX
." Testing MAX and MIN" 5 spaces
20 10 dup2 MAX 20 = if ." MAX passed and " else ." MAX FAILED and " then min 10 = if ." MIN passed." else ." MIN FAILED." then cr ;

: TESTSHIFTS
." Testing LSHIFT and RSHIFT" 5 spaces
10 4 lshift 160 = IF ." LSHIFT passed " ELSE ." LSHIFT FAILED " then 48 2 rshift 12 = if ." RSHIFT passed " ELSE ." RSHIFT FAILED " THEN cr ;


: VERIFYWORDLIST 
." Verifying WORDS .... " WORDS CR ;

: TESTLITERALNUMB 
." Testing LITERALNUMB .... " 213 213 = IF ." LITERALNUMB passed " ELSE ." LITERALNUMB FAILED " THEN CR ;

: TESTVARIABLE 
." Testing VARIABLE and VARIN (and @ and !)" 5 SPACES
VARIABLE OLDGEEZER 901 OLDGEEZER ! OLDGEEZER DUP @ 1+ SWAP ! OLDGEEZER @ 902 =
IF ." VARIABLE, VARIN, @ and ! passed " ELSE ." VARIABLE, VARIN, @ and ! FAILED " THEN CR ;

: TESTCONSTANT
." Testing CONSTANT " 5 SPACES
365 CONSTANT DAYS 7 CONSTANT WEEK DAYS WEEK / 52 = IF -3 CONSTANT NEGNUMB NEGNUMB WEEK + 4 = IF ." CONSTANT passed " ELSE ." CONSTANT FAILED " THEN CR ELSE ." CONSTANT has FAILED " THEN CR ;

: TESTTYPE 
." Verifying GETLINE, TYPE and TIB " CR ." Please enter some text to be echoed back. " CR
GETLINE CR ." Echoing... " TIB SWAP TYPE CR ;

: TESTCHAR
." Testing CHAR" 5 spaces
char Z 90 = IF char z 122 = IF ." CHAR passed " else ." CHAR FAILED " then else ." CHAR FAILED " THEN cr ;

: VERIFYSOURCE 
." Verifying SOURCE" 5 spaces
source type cr ;

\ Test if else then
: TESTCONDITIONALS 
." Testing IF ... ELSE ... THEN conditionals. " CR
1 if ." Simple IF passed " else ." Simple IF FAILED " then cr
0 1 if ." Testing nested IF... " if ." Nested IF FAILED " else ." Nested IF passed " then 5 5 * . then ." = 25 " cr
1 0 if ." Failed a final test of IF " else ." A final test of IF ... " if ." is passed " else ." is FAILED " then then cr ;

\ Stuff to test EXIT
: EXITTEST1
EXIT ." If you see this EXIT FAILED " CR ;
: EXITTEST2
VARIABLE EXITVAR 200 EXITVAR ! ;
: EXITTEST3
EXITVAR DUP @ 1+ SWAP ! EXITVAR DUP @ 1+ SWAP ! EXIT EXITVAR DUP @ 1+ SWAP ! ;

: TESTEXIT
." Testing EXIT " 5 SPACES EXITTEST1
EXITTEST2 EXITTEST3 EXITVAR @ 202 = IF ." EXIT passed " ELSE ." EXIT FAILED " THEN CR ;

\ Test return stack words

: TESTRSTACKBASICS
." Testing >R, R@ and R> along with RDROP" cr
34 35 36 >R >R >R R@ 34 = RDROP R@ 35 = AND RDROP R@ 36 = AND RDROP if ." >R, R@ and RDROP passed " else ." >R, R@ and RDROP FAILED" then cr
99 >R R> 99 = if ." R> passed " else ." R> FAILED " then cr ;

\ loop
: TESTBEGINEND
." Testing BEGIN ... END loop " 5 SPACES
32 BEGIN DUP EMIT 1+ DUP 127 > END ."  BEGIN ... END passed " CR ;

: TESTBEGINWHILE
." Testing BEGIN ... WHILE " 5 spaces
32 BEGIN DUP space hex . space decimal DUP 100 < IF DUP EMIT 1+ ELSE DUP 32 - EMIT 1+ WHILE DUP 110 = END ."  BEGIN ... WHILE passed " cr ;

: TESTDOLOOP
." Testing DO ... LOOP " 5 spaces
1 10 1 DO DUP 1+ LOOP 10 = IF ." DO ... LOOP passed" ELSE ." DO ... LOOP FAILED" THEN CR ;

: TESTPLUSLOOP
." Testing DO .... +LOOP" 5 SPACES
1 100  1 DO DUP 1+ 101 +LOOP 2 = IF ." DO ... +LOOP passed" ELSE ." DO .... +LOOP FAILED" THEN CR ;

: VERIFYIJ
." Verifying I and J in nested loops" CR
10 0 DO 10 0 DO ." ( "  J . ." , " I . ." ) " LOOP CR LOOP
." I and J verified" CR ; 

: VERIFYLEAVE
." Verifying LEAVE and UNLOOP " CR
10 0 DO 10 0 DO J I > J I = OR IF ." ( "  J . ." , " I . ." ) " ELSE UNLOOP LEAVE 3 0 DO LOOP 3 0 DO LOOP 3 0 DO LOOP THEN LOOP CR LOOP
." LEAVE and UNLOOP verified" CR ;


\ Testing memory functions
: ZZ ." ', EXECUTE and C! passed " ;

: TESTINGTICK 
." Testing ', EXECUTE and C! " 5 spaces
hex 0x58 decimal ' ZZ 24 + C! ' XZ execute cr
\ Change back or else subsequent tests will break
." Testing one more time " 5 spaces
hex 0x5A decimal ' xz 24 + C! ' zZ exeCUTE  cr ;

: testcfetch 
." Testing C@" 5 spaces
' XOR 24 + c@ 88 = if ." C@ passed " else ." C@ FAILED " then cr ;

\ Dummy words to use in MOVE test
: ZM * ;
: ZD / ;
: reup decimal 68 ' ZM 25 + C! ;

: TESTINGMOVE
." Testing MOVE " 5 spaces
10 10 ZM 100 = IF ' ZM 24 + ' ZD 24 + 24 move 100 2 ' ZM execute 50 = IF ." MOVE passed " else ." MOVE FAILED " then cr else ." Test failure " then reup ;

: TESTFETCH
." Testing @ (and BASE)" 5 spaces
octal base @ 10 = hex base @ 0x10 = AND decimal base @ 10 = AND if ." @ and BASE passed" ELSE ." @ and BASE FAILED" then cr ;

: TESTPLUSSTORE
." Testing +! " 5 SPACES ' ZM 24 + -1  SWAP +! 5 5 ' YM EXECUTE 25 = IF ." +! passed " ELSE ." +! FAILED " THEN 2 SPACES
' YM 24 + 1 SWAP ' +! execute 5 5 ' ZM EXECUTE 25 = INVERT IF ." +! address find FAILED " THEN  CR ;

: TESTPADFILLERASE
." Testing PAD, FILL and ERASE " 5 SPACES
PAD 10 35 FILL PAD 3 + 1 ERASE PAD 2 + C@ 35 = PAD 3 + C@ 0 = AND PAD 4 + C@ 35 = AND IF ." PAD, FILL and ERASE passed" ELSE ." PAD, FILL and ERASE FAILED" THEN CR ;

\ Test groupings

\ Memory tests
: TESTMEMORY
." Testing memory manipulation words" cr
TESTINGTICK testcfetch testingmove testchar testfetch testplusstore TESTPADFILLERASE
." Testing of memory code complete" cr ;

\ Test loops
: TESTLOOPS
." Running tests of looping " cr
TESTBEGINEND testbeginwhile TESTDOLOOP TESTPLUSLOOP VERIFYIJ VERIFYLEAVE
." Testing of loops complete" CR ;

\ Test Rstack
: RSTACKTESTS
." Testing return stack" cr
testrstackbasics 
." Testing return stack complete" cr ;

\ Test listwords
: LISTWORDSTESTS
." Running 'listwords' group of tests " CR
VERIFYWORDLIST TESTLITERALNUMB TESTVARIABLE TESTTYPE
VERIFYSOURCE TESTCONSTANT
." 'listwords' group of tests complete " CR ;

\ Test integer
: INTEGERTESTS
." Running integer tests " cr
TESTADD TESTMUL TESTDIV TESTSUB TESTPLUS1 TESTMINUS1
TESTminus2 testplus2 testunderplus testminmax testmod testslmod testabs testnegate testshifts
." Integer tests complete " CR
;

\ Group of stack operations tests
: STACKOPTESTS
." Running stackop tests " cr
TESTOVER2
testDrop2 testSquare
testCube
testNIP2 testDUP2
testtuck2 testswap2 testrot2 testdup testbl testdepth
testinvert
." stackop tests over " cr
;

\ Group of Basics tests
: BASICSTESTS
." Running basics tests and verifications " cr
VERIFYTYPEPROMPT
VERIFYGETNEXTLINE_IMM
VERIFYOK
VERIFYTOKENIZE_IMM 
VERIFYSEARCH OK
." ***Any error message above can almost certainly be ignored*** " CR
TESTHEX TESTDECIMAL TESTOCTAL VERIFYBINARY TESTEXIT
." Verifying ENCSQ with this output " cr
." Verifying COMMENT " cr \ ." COMMENT verification FAILED " 
VERIFYDOT
." Basics tests and verifications over " cr
;


\ Run all the tests
: UTS
DECIMAL
." Running unit tests " cr
STACKOPTESTS
." Press enter to continue " GETLINE CR
BASICSTESTS
." Press enter to continue " GETLINE CR
INTEGERTESTS
." Press enter to continue " GETLINE CR
LISTWORDSTESTS
." Press enter to continue " GETLINE CR
TESTCONDITIONALS
." Press enter to continue " GETLINE CR
RSTACKTESTS
." Press enter to continue " GETLINE CR
TESTLOOPS
." Press enter to continue " GETLINE CR
TESTMEMORY
." Press enter to continue " GETLINE CR
 ABORT" Verifying ABORTCOMM and leaving tests with this message "  ." ABORTCOMM has FAILED"
;

A moment of doubt?


RISCYFORTH in action

I have been working on Riscyforth, my Forth-like language for RISC-V single board computers (SBCs) for seven months now.

When I started there weren’t even any (affordable) SBCs available. Now there is one (which is really just an evaluation board) and the mass market is still not here – so it’s not as if I am seriously lagging behind.

But I am wondering if I should keep going.

In truth progress on the project is accelerating. I have a better idea of what I am doing and I am writing better assembly than I was back in December.

But partly that is the problem – because I can also understand that just writing some assembly code to shave off a few cycles (maybe) compared to the C-based alternatives like Gforth isn’t really enough.

For while I have now got to the point where I can write and load external programs in what is (almost) Forth – and spent the whole of last week debugging issues my test Forth revealed – I am still miles away from a powerful product that is near to complete and which the first purchasers of the Beagle-V or whatever other RISC-V SBC makes to the mass market first.

Experience with the Raspberry Pi shows that SBCs can be popular and so interest in RISC-V based SBCs might be considerable if they come in at a competitive cost point (though that is probably some time away yet as the market will need to build).

I enjoy the effort but I am wondering if I should keep going.