Adding dynamic linking to Riscyforth


This is one of those blogs one writes to clarify inchoate thoughts, so it’s sorry-not-sorry if it starts to wander around…

Last week I got to an important point with Riscyforth, my Forth for RISC-V single board computers: I tagged my first release (v0.1).

I did that because I had, finally, managed to cover all of the core and extended core code of the Forth 2012 standard. Essentially that means something written using words from the core of the language should compile and run (and give correct results).

But merely covering the core definitions doesn’t deliver a system that many real-world users will want to play with except out of curiosity. No floating points maths, no control over screen output beyond, essentially, printing line by line (though there are a few additions here like coloured output) and certainly no graphics.

But I also don’t want to build an ever-larger monolithic system. Users should be able to pick and choose which parts of the extended language they get (as well, of course, being free to extend it in Forth itself).

That is going to require some form of dynamic loading and linking. Whilst all the tools are now available for anyone (with a screw-loose) to write their own extended floating point words using the integer primitives, it’s much more sensible to provide words (written in assembly) that call the library code provided by the operating system.

My initial thought was that there should be a riscyforth.conf file in the Riscyforth distribution that would list what modules are loaded on startup. But I now think a more flexible way is to provide a new Forth word – LOADMODULE ("<name>" -- flag) say – that would specify a module that could be loaded. This word could be used at anytime, but if we put it in a short Forth file/program – eg startup.fth – then Riscyforth could be made to execute this at start-up time and so users would edit that program/file to fix what they started with.

But how to implement? The best way seems to be to supply object files (along with the source, this is a GPL-licenced project) and have mmap load them into memory. My code already uses mmap to create the space where new words can be written, so the theory seems sound enough and I have found this on the web which says it can be done. But it might also be a bit of a challenge – but then again, it wouldn’t be any fun if it wasn’t.

Getting the POSTPONE word to work in Forth

RISCYFORTH in action

I have struggled with this issue for several weeks now and I didn’t find the explanations in Starting Forth or the 2012 Forth standard or even the GForth documentation particularly enlightening, so I wanted to write this to – hopefully – add some clarity to anyone else trying to get this right in their own Forth implementation.

The standard says this of POSTPONE:

Compilation:

“<spaces>name” — )

Skip leading space delimiters. Parse name delimited by a space. Find name. Append the compilation semantics of name to the current definition. An ambiguous condition exists if name is not found.

But what does that actually mean? I think there are three rules you need to apply (of which more later), but first you need to think about where they are applied. Typically you might have something like this:

: ENDIF POSTPONE THEN ; IMMEDIATE

: EXAMPLE IF ." Say something" ELSE ." Say something different" ENDIF ;

ENDIF here replacing the modern standard THEN (some older Forth code may contain the ENDIF word, and our ENDIF should be a perfect replacement).

Now, we have defined ENDIF in its own word but we want it to act inside the word EXAMPLE – in this case we want it to ensure that what we might call the contractual obligations of THEN (in this case to inform ELSE clause where code following the ‘true’ path should jump to when it hits ELSE).

Now, defining our ENDIF word as IMMEDIATE ensures it is activated when we are compiling EXAMPLE, but it all it did was call itself then it would do literally nothing (as in Riscyforth, in any case, THEN in execution is a NOP).

What we actually want to see happen is that compilation actions of THEN are activated and that their effects are seen in EXAMPLE.

In this case that means calculating and inserting the jump-to address for the ELSE. Exactly what a THEN would do if it were there in EXAMPLE.

(The badging of ENDIF as IMMEDIATE then gives us a second boost – because that labelling means it doesn’t get called when EXAMPLE is executed, so we don’t have to worry about it trying to compile in those actions every time EXAMPLE is called.)

In Riscyforth there are a number of words – currently LITERAL, DOES>, ABORT”, .”, S”, , [‘], COMPILE,, IS, TO, [CHAR], ACTION-OF, AGAIN, UNTIL, DO, ?DO, LOOP, +LOOP, -LOOP, UNLOOP, LEAVE, RECURSE, I, J, EXIT, IF, ELSE, THEN, OF, WHILE, REPEAT, and ; (though others will probably need to be added as the code is matured) – that need special handling on compilation and that has to be reflected if they are POSTPONE’d with those special handling routines compiled into the colon word being worked – for the other, “ordinary”, words what has to happen is that a reference to them is what gets compiled in – here’s another example:

: -IF POSTPONE 0= POSTPONE IF ; IMMEDIATE

In this case we’ve defined a new word -IF which works in the opposite sense (Boolean-wise) to the standard IF, i.e, what was FALSE on IF is treated as TRUE in -IF and so on. The key here is that, for example:

: WASFALSE -IF ." Result was FALSE " ELSE ." Result was TRUE " THEN ;

That the 0= gets compiled into the definition of WASFALSE and not left to languish inside -IF. Because – again – -IF, being marked as IMMEDIATE, never gets called again after the initial invocation on compilation – what it leaves behind is what gets called.

So – those three rules:

  1. If a word is marked as IMMEDIATE then POSTPONE strips it of that character and it becomes JAW – just another word. These words are not compiled into a new definition, they are simply executed when they are reached.
  2. If a word requires special handling on compilation that special handling should be repeated when it is POSTPONE’d.
  3. All POSTPONE’d words that are not immediate should be compiled into the current colon definition being worked on (that phrase “the current definition” from the standard is extremely confusing).





More on the Riscyforth versus Gforth shootout


This is a further post about optimising Riscyforth, my Forth language for RISC-V single board computers.

Riscyforth is a 64-bit Forth which runs on top of Linux – my test system is the Sipeed Nezha SBC – which runs a 1 GHz single core CPU (so 1 cycle is one nano-second). I’m testing it against GForth, the GNU Project’s Forth.

Whilst Riscyforth is written in RISC-V assembly, GForth is mainly Forth on top of C. For those familiar with Forth techniques – Riscyforth is a very traditional indirect threaded implementation, GForth uses a mixture of this and (faster) direct threaded code.

GForth also sticks to the Forth standard of 32-bit numbers as the default, Riscyforth uses 64 bit numbers throughout.

GForth is a much more sophisticated (and portable) environment than Riscyforth – and certainly my suggestion/intention has never been to claim I will write a better Forth than GForth – the GForth effort goes back to the 1990s, Riscyforth goes back to 28 December 2020!

The two key posts before this are:

The quick summary of the material those two posts is that I could make significant timing savings in my code with some optimisation steps that replaced correct but long-winded compile-time generated code that loaded registers with fixed (at compile-time) numbers via a sequence of adds and bit shifts, with code that loaded those values from a known memory location.. However those optimisations were not enough to close the gap on performance with the GNU Project’s GForth – which runs matrix.fth about 12% faster.

My code started out at taking about 9.5 seconds and I have got that down to about 8.5 seconds, whilst GForth is taking about 7.5 seconds.

Not satisfied with leaving it there I wrote some code to count cycles taken in executing Forth words or groups of words, and looked closer at what might be fixable.

		CODEHEADER DEBUGIN, Q, 0x0
		#(--)
		#debug word
		la t0, CYCLESTART
		rdcycle t1
		sd t1, 0(t0)
		tail NEXT
		
		CODEHEADER DEBUGOUT, DEBUGIN, 0x0
		#(--)
		#debug word
		rdcycle t0
		la t1, CYCLESTART
		ld t2, 0(t1)
		bgt t0, t2, debugout_normal
		li t3, -1
		sub t4, t3, t2
		add t4, t4, t0
		j debugout_sum
  debugout_normal:
		sub t4, t0, t2
  debugout_sum:
		la t5, CYCLECOUNT
		ld t0, 0(t5)
		add t1, t0, t4
		sd t1, 0(t5)
		la t6, CYCLEPINGS
		ld t2, 0(t6)
		addi t2, t2, 1
		sd t2, 0(t6)
		tail NEXT

		CODEHEADER DEBUGRESULTS, DEBUGOUT, 0x0
		#(-- n n n)
		la t0, CYCLECOUNT
		la t1, CYCLEPINGS
		ld t2, 0(t0)
		ld t3, 0(t1)
		div t4, t2, t3
		addi sp, sp, -24
		sd t2, 0(sp)
		sd t3, 8(sp)
		sd t4, 16(sp)
		sd zero, 0(t0)
		sd zero, 0(t1)
		la t2, CYCLESTART
		sd zero, 0(t2)
		tail NEXT

DEBUGIN records the cycle count at the start of some block of code, whilst DEBUGOUT calculates the time since the previous call to DEBUGIN, adds that to a total of elapsed times and increments a counter. DEBUGRESULTS then places, on the stack, the average time per IN – OUT cycle, the total number of samples and the total elapsed cycles.

Looking at the Forth code, the main word is what we call to run the program:

It initialises two 300 x 300 matrices (each initialisation taking about 53 million cycles) and then executes a loop and a nested loop: each loop runs 300 times, so the inner loop is executed 90000 times. The outer loop is recorded as taking just over 26 million cycles per execution whilst each inner loop takes about 87000 cycles per iteration.

The code to measure the cycles has an effect on the measurement so we can only be approximate here – eg while the inner loop is measured as taking 87000 cycles per iteration, each execution of innerproduct is measured as requiring around 88000 cycles (ie, longer than the code that calls it – an obvious contradiction). The test code itself is measured as taking about 15 cycles per execution – not much in itself, but rising to 409 million total cycles when called 27 million times inside innerproduct.

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
    r>
  loop
  >r 2drop r>
;

In any case it is clear that innerproduct is where the real work is done:

Inside innerproduct lurks another loop, again called 300 times, meaning that the code inside this loop gets called 300 x 300 x 300 – 27 million – times: the graphic above shows the approximate (having accounted for the cost of calling the instrumentation code) total cycle count for each line of that loop.

Right at the start of this (see the first post linked above), the big cycle burner was the constant code: row-byte-size here, so it’s good that no longer seems to be the case.

It’s hard to be precise about this (a test of the row-size constant in main suggested, accounting for the cost of the instrumentation code, that it took about 269 cycles per execution but that looks to be a significant over-estimate as that would imply, if row-byte-size took the same time as the other constant (as it should) that 27 million executions of it alone should take about 7 billion cycles. A better guide seems to be that the nine words of the first line take a little bit less than twice the cycles/time of the five words of the second line – suggesting that each word takes on average a roughly similar time.

The final line in innerproduct is not inside the innermost loop and so is only called 90,000 times and takes, roughly, 8 million cycles.

It’s not possible to measure the cost of the loop control code using my simple instrumentation (as execution is not linear – all the work after the loop initialisation is done at the loop word) but as the total execution time of innerproduct (as measured inside innerproduct itself) is estimated at 7.6 billion cycles it seems unlikely the do ... loop code is particularly expensive.

This is all good news at one level: there appear to be no bits of code here chewing up cycles outrageously (each of the 14 words of the first two lines take roughly 19 cycles to execute, including the time taken to move from one word to another via the Forth threading).

But at level another it represents a bit of a dead end for efforts to optimise the code. All the words being executed here are pared down to the minimum and there is no room to squeeze anything out of them.

I covered that before in the initial posts but here’s a quick reminder/example:

		CODEHEADERZ TOR, >R, TOR2, 0x01 #>R
		POP t0
		addi s9, s9, -STACKOFFSET
		sd t0, 0(s9)
		tail NEXT

		CODEHEADERZ RFROM, R>, TOR, 0x01 #R>
		ld t0, 0(s9)
		PUSH t0
		addi s9, s9, STACKOFFSET
		tail NEXT

The >R and R> words are called repeatedly in the loop. They use the POP and PUSH macros which add or remove a 64 bit register from the stack:

.macro PUSH register
        addi sp, sp, -8
        sd  \register, 0(sp)
.endm

.macro POP register
        ld \register, 0(sp)
        addi sp, sp, 8
.endm

While the s9 register is used to manage the return stack (ie as a stack pointer to that stack) – so you can see these words just transfer a 64 bit value to and from two different stacks and there is just nothing that can be done to save cycles here (or if there is, I’d very much welcome being told about it).

So, it seems, GForth is the winner here – a Forth based on optimised C is not just portable but would appear to be faster too.

But, actually, I have some counter evidence.

The Forth program factoriel.f calculates the numerals of factorials. In Riscyforth I can calculate the factorial of 10000 (ie, 10000!) in about 26 seconds:

But GForth cannot manage it at all – though if I double the number of cells allocated in GForth (which means both systems have the same amount of resources due to the difference between 64 and 32 bits) it will work – but it takes around 36 seconds.

GForth will calculate the factorial of smaller numbers faster than Riscyforth though: if we try 3200! (close to the limit GForth can manage with the original settings for cell numbers) then GForth takes about 7 seconds and Riscyforth takes about 8 seconds.

Taken together these results suggest to me that at least there is nothing broken in Riscyforth. It’s about as fast as it could be with 64 bit arithmetic and the indirect threaded model.

A book recommendation


Lying in the bath this morning – relaxing into the Christmas holiday (despite the raging epidemic of omicron-variant Covid that is currently rampaging through London in an increasingly frightening manner), I had thoughts about several blog posts to write over the break, and this is the first…

It’s a book recommendation: Ian Stewart‘s Concepts of Modern Mathematics. It’s around 50 years since its first edition (and even this Dover edition is 26 years old) but – part from the disconcerting habit of assuming all humans are male – it doesn’t feel like a dated work at all.

In one other way, though, it is a product of its times – the controversy over “new mathematics” – the teaching methods adopted widely across the west in the 60s and 70s in an effort to improve mathematical understanding of students and move beyond rote-teaching. Interestingly, as the wikipedia article demonstrates, its adoption was driven, in part, by the post-Sputnik panic in US science and engineering.

It’s adoption was certainly controversial as Tom Lehrer makes clear here…

Now, having been educated in both Ireland (primary and secondary) and England (secondary) I have actually seen both approaches in operation at first hand – and its certainly the case that the traditional methods gave me a solid (algorithmic in the sense of using a simple and repetitive method) grounding in such things as long division which I think were missing in England, but it’s also not difficult to see why discussions of things like sets, modular arithmetic and matrices are pretty essential for anyone thinking of a career in science and technology.

That said the UK’s School Mathematics Project seems to have avoided what appears to be the excesses of “new math” in the US. I cannot produce any evidence for it but I am tempted to wonder if the UK’s 1980s (and continuing) software boom was in part driven by its innovation in maths teaching (read Francis Spufford’s excellent Backroom Boys for more on that).

Indeed these things are so fundamental to software development that it’s hard to imagine a case for not teaching them – but 50 years ago they were clearly seen as in some way esoteric or even fetishistic.

Professor Stewart’s book concentrates on understanding of the concepts and not rigorous proofs – which makes the book readable and digestible in settings like my bath. I am not going to say it could be mastered by anyone regardless of their current understanding of maths, but its not a traditional textbook and I would suggest it would be a good backgrounder if you had basic (O level/GCSE) maths qualifications and were interested in working as a software developer.

One of the points the author makes in his preface is that the UK has seen its science and technology bases erode as more and more graduates look for careers in law and financial services. I hope that will become the subject of another blog here shortly.

Riscyforth: an update


Last Christmas I started work on a Forth for RISC-V single board computers (SBCs): Riscyforth.

Riscyforth is written in RISC-V assembly, with the aim of providing a good balance between the ease of use that comes from a programming language that can both be run in an interpreted “immediate” mode and compiled if speed is the priority.

Forth is a stack based language – commands operate on what’s on top of the stack, hence:

2 2 +

Computes 2 + 2 and so on. It can sometimes make the code hard to read, but it’s also a powerful and extensible language – very much out of fashion these days but also very hacker (as in coder/experimenter) friendly.

At the time I began no SBCs existed, but they were expected. Since then, one has turned up – the Nezha – while the one we were all expecting, the Beagle V, was cancelled but it is looking better for 2022.

Progress on Riscyforth has been slower than I probably expected, certainly slower than reading the book on “threaded interpreted languages“- my initial inspiration – led me to expect: but the delays in hardware have also given me more time.

So how does it look?

Well, it’s now pretty mature – I have implemented a substantial part of the Forth 2012 core specification (and some other extensions too), but there is still a lot of work to be done:

  • Output is still pretty basic – you get lines of text and there is no support for nCurses type output, never mind graphics.
  • Some basic structures are still missing (such as the Forth equivalent of switch/case – but I have recently finished a restructuring of code/ pay down of technical debt that makes all this much easier to implement.
  • Despite the initial idea of writing this to give developers faster access to the hardware, all the work has gone into the core language and no hardware specific extensions have yet been written.

But it works, and works well. A couple of crude tests show it is doing basic looping about 10 – 20% faster than GForth, the GNU Project’s Forth compiler (which works well for RISC-V) – though you are getting a lot else besides with GForth.

If you are at all interested your comments and – perhaps – participation in the project would be very welcome.

Overflow in unsigned arithmetic in C


This came up in work recently – and I got a bit confused about it, so I thought it was worth writing it down in case someone like me needs to look it up.

Let’s consider a real world example – we have a 16 bit counter that increments on every event X and which we sample periodically: at our last sampling was at value 0xFFFD and at our current sample is 0x03. How many incidents of X have occurred?

Ignoring the possibility of a ‘double rollover’ or anything similar, the answer is 6 (don’t forget the counter goes through 0). But how can we calculate this programmatically?

It turns out it’s very simple and this always works:

uint16_t numberEvents = currentSample - previousSample

In C the calculation (constrained to 16 bits) of 0x03 – 0xFFFD with unsigned arithmetic results in an overflow (we cannot calculate a negative number) but the language standard gives us a well-defined answer that works well.

The answer of -65530 is reinterpreted as modulo 0x10000 (65536) and this comes out at 6. To explain why think of a smaller number system – modulo 5 say. Then the number line (integers and integers mod 5) looks like this:

Integers: 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Modulo 5: 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0

Under this system we can see that 1 – 4 = -3 for integers but 2 mod 5 – which if this was a simple two bit counter would be the correct number of increments that had taken place.

Game of Life in Scratch


Game of LifeA few days ago I asked for volunteers to read a book I was writing on programming, using Scratch, MIT’s visual, event-driven, programming environment.

I have not yet had any volunteers, though the flurry of online interest did get me to complete the first draft – so alpha testers still needed.

In the meantime, I have also published the version of Conway’s Game of Life the text is based around. It’s a slightly unusual implementation in that the surface of the game is, in effect, spherical in effect a torus – i.e. the edges are effectively a feature of the projection of the surface but left joins to right, top to bottom and so on.

Update: Been pointed out to me that this isn’t a sphere but a torus, and so I have updated the copy.

Squashing the LRU bug


English: Typical actions taken upon a virtual-...
English: Typical actions taken upon a virtual-physical address translation. (Photo credit: Wikipedia)

Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.

Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.

Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.

Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.

So now the code works – at least I think it does!

First working assembly code in 33 years


When I say working, I mean boots into an endless loop with no ouput to the user – but then, that’s what it’s meant to do:

/* 	Microblaze start up code
	Copyright Adrian McMenamin <acm538@york.ac.uk>, 2014
	Licensed under GPL v3
*/


	.global _start
	.section .vectors.reset, "ax"
	.align 2
	.ent _start
	.type _start, @function

_start:
	brai	_actualstart
	.end _start

	.section .vectors.sw_exception, "ax"
	.align 2
_vector_sw_exception:
	brai	_handle_exception

	.section .vectors.interrupt, "ax"
	.align 2
_vector_interrupt:
	brai	_interrupt_handler

	.section .vectors.breakpoint, "ax"
	.align 2
_vector_breakpoint:
	brai _handle_breakpoint

	.section .vectors.hw_exception, "ax"
	.align 2
_vector_hw_exception:
	brai	_handle_hwexception

	.section .text
	.global _actualstart
	.align 4
	.ent _actualstart
	.type _actualstart, @function

_actualstart:
	mts	rmsr, r0
	mts	rslr, r0
	addi	r8, r0, 0xFFFFFFFF
	mts	rshr, r8

	nop
	addik	r3, r0, 0x3F
	mts	rtlbx, r3



_handle_exception:
_interrupt_handler:
_handle_hwexception:
_handle_breakpoint:
	nop
	bri 0 /*endless loop if it ever works */

Only have to get it to do something useful now, how hard could that be? (There is a linker script to get this to work also – as it has to be loaded to address 0x00000000. Have a look at my github (mcmenaminadrian) repos.

Curses on ncurses


gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!