More on the Riscyforth versus Gforth shootout

This is a further post about optimising Riscyforth, my Forth language for RISC-V single board computers.

Riscyforth is a 64-bit Forth which runs on top of Linux – my test system is the Sipeed Nezha SBC – which runs a 1 GHz single core CPU (so 1 cycle is one nano-second). I’m testing it against GForth, the GNU Project’s Forth.

Whilst Riscyforth is written in RISC-V assembly, GForth is mainly Forth on top of C. For those familiar with Forth techniques – Riscyforth is a very traditional indirect threaded implementation, GForth uses a mixture of this and (faster) direct threaded code.

GForth also sticks to the Forth standard of 32-bit numbers as the default, Riscyforth uses 64 bit numbers throughout.

GForth is a much more sophisticated (and portable) environment than Riscyforth – and certainly my suggestion/intention has never been to claim I will write a better Forth than GForth – the GForth effort goes back to the 1990s, Riscyforth goes back to 28 December 2020!

The two key posts before this are:

The quick summary of the material those two posts is that I could make significant timing savings in my code with some optimisation steps that replaced correct but long-winded compile-time generated code that loaded registers with fixed (at compile-time) numbers via a sequence of adds and bit shifts, with code that loaded those values from a known memory location.. However those optimisations were not enough to close the gap on performance with the GNU Project’s GForth – which runs matrix.fth about 12% faster.

My code started out at taking about 9.5 seconds and I have got that down to about 8.5 seconds, whilst GForth is taking about 7.5 seconds.

Not satisfied with leaving it there I wrote some code to count cycles taken in executing Forth words or groups of words, and looked closer at what might be fixable.

		#debug word
		rdcycle t1
		sd t1, 0(t0)
		tail NEXT
		#debug word
		rdcycle t0
		ld t2, 0(t1)
		bgt t0, t2, debugout_normal
		li t3, -1
		sub t4, t3, t2
		add t4, t4, t0
		j debugout_sum
		sub t4, t0, t2
		ld t0, 0(t5)
		add t1, t0, t4
		sd t1, 0(t5)
		ld t2, 0(t6)
		addi t2, t2, 1
		sd t2, 0(t6)
		tail NEXT

		#(-- n n n)
		ld t2, 0(t0)
		ld t3, 0(t1)
		div t4, t2, t3
		addi sp, sp, -24
		sd t2, 0(sp)
		sd t3, 8(sp)
		sd t4, 16(sp)
		sd zero, 0(t0)
		sd zero, 0(t1)
		sd zero, 0(t2)
		tail NEXT

DEBUGIN records the cycle count at the start of some block of code, whilst DEBUGOUT calculates the time since the previous call to DEBUGIN, adds that to a total of elapsed times and increments a counter. DEBUGRESULTS then places, on the stack, the average time per IN – OUT cycle, the total number of samples and the total elapsed cycles.

Looking at the Forth code, the main word is what we call to run the program:

It initialises two 300 x 300 matrices (each initialisation taking about 53 million cycles) and then executes a loop and a nested loop: each loop runs 300 times, so the inner loop is executed 90000 times. The outer loop is recorded as taking just over 26 million cycles per execution whilst each inner loop takes about 87000 cycles per iteration.

The code to measure the cycles has an effect on the measurement so we can only be approximate here – eg while the inner loop is measured as taking 87000 cycles per iteration, each execution of innerproduct is measured as requiring around 88000 cycles (ie, longer than the code that calls it – an obvious contradiction). The test code itself is measured as taking about 15 cycles per execution – not much in itself, but rising to 409 million total cycles when called 27 million times inside innerproduct.

: innerproduct ( a[row][*] b[*][column] -- int)
  0 row-size 0 do
    >r over @ over @ * r> + >r
    swap cell+ swap row-byte-size +
  >r 2drop r>

In any case it is clear that innerproduct is where the real work is done:

Inside innerproduct lurks another loop, again called 300 times, meaning that the code inside this loop gets called 300 x 300 x 300 – 27 million – times: the graphic above shows the approximate (having accounted for the cost of calling the instrumentation code) total cycle count for each line of that loop.

Right at the start of this (see the first post linked above), the big cycle burner was the constant code: row-byte-size here, so it’s good that no longer seems to be the case.

It’s hard to be precise about this (a test of the row-size constant in main suggested, accounting for the cost of the instrumentation code, that it took about 269 cycles per execution but that looks to be a significant over-estimate as that would imply, if row-byte-size took the same time as the other constant (as it should) that 27 million executions of it alone should take about 7 billion cycles. A better guide seems to be that the nine words of the first line take a little bit less than twice the cycles/time of the five words of the second line – suggesting that each word takes on average a roughly similar time.

The final line in innerproduct is not inside the innermost loop and so is only called 90,000 times and takes, roughly, 8 million cycles.

It’s not possible to measure the cost of the loop control code using my simple instrumentation (as execution is not linear – all the work after the loop initialisation is done at the loop word) but as the total execution time of innerproduct (as measured inside innerproduct itself) is estimated at 7.6 billion cycles it seems unlikely the do ... loop code is particularly expensive.

This is all good news at one level: there appear to be no bits of code here chewing up cycles outrageously (each of the 14 words of the first two lines take roughly 19 cycles to execute, including the time taken to move from one word to another via the Forth threading).

But at level another it represents a bit of a dead end for efforts to optimise the code. All the words being executed here are pared down to the minimum and there is no room to squeeze anything out of them.

I covered that before in the initial posts but here’s a quick reminder/example:

		POP t0
		addi s9, s9, -STACKOFFSET
		sd t0, 0(s9)
		tail NEXT

		ld t0, 0(s9)
		PUSH t0
		addi s9, s9, STACKOFFSET
		tail NEXT

The >R and R> words are called repeatedly in the loop. They use the POP and PUSH macros which add or remove a 64 bit register from the stack:

.macro PUSH register
        addi sp, sp, -8
        sd  \register, 0(sp)

.macro POP register
        ld \register, 0(sp)
        addi sp, sp, 8

While the s9 register is used to manage the return stack (ie as a stack pointer to that stack) – so you can see these words just transfer a 64 bit value to and from two different stacks and there is just nothing that can be done to save cycles here (or if there is, I’d very much welcome being told about it).

So, it seems, GForth is the winner here – a Forth based on optimised C is not just portable but would appear to be faster too.

But, actually, I have some counter evidence.

The Forth program factoriel.f calculates the numerals of factorials. In Riscyforth I can calculate the factorial of 10000 (ie, 10000!) in about 26 seconds:

But GForth cannot manage it at all – though if I double the number of cells allocated in GForth (which means both systems have the same amount of resources due to the difference between 64 and 32 bits) it will work – but it takes around 36 seconds.

GForth will calculate the factorial of smaller numbers faster than Riscyforth though: if we try 3200! (close to the limit GForth can manage with the original settings for cell numbers) then GForth takes about 7 seconds and Riscyforth takes about 8 seconds.

Taken together these results suggest to me that at least there is nothing broken in Riscyforth. It’s about as fast as it could be with 64 bit arithmetic and the indirect threaded model.

A book recommendation

Lying in the bath this morning – relaxing into the Christmas holiday (despite the raging epidemic of omicron-variant Covid that is currently rampaging through London in an increasingly frightening manner), I had thoughts about several blog posts to write over the break, and this is the first…

It’s a book recommendation: Ian Stewart‘s Concepts of Modern Mathematics. It’s around 50 years since its first edition (and even this Dover edition is 26 years old) but – part from the disconcerting habit of assuming all humans are male – it doesn’t feel like a dated work at all.

In one other way, though, it is a product of its times – the controversy over “new mathematics” – the teaching methods adopted widely across the west in the 60s and 70s in an effort to improve mathematical understanding of students and move beyond rote-teaching. Interestingly, as the wikipedia article demonstrates, its adoption was driven, in part, by the post-Sputnik panic in US science and engineering.

It’s adoption was certainly controversial as Tom Lehrer makes clear here…

Now, having been educated in both Ireland (primary and secondary) and England (secondary) I have actually seen both approaches in operation at first hand – and its certainly the case that the traditional methods gave me a solid (algorithmic in the sense of using a simple and repetitive method) grounding in such things as long division which I think were missing in England, but it’s also not difficult to see why discussions of things like sets, modular arithmetic and matrices are pretty essential for anyone thinking of a career in science and technology.

That said the UK’s School Mathematics Project seems to have avoided what appears to be the excesses of “new math” in the US. I cannot produce any evidence for it but I am tempted to wonder if the UK’s 1980s (and continuing) software boom was in part driven by its innovation in maths teaching (read Francis Spufford’s excellent Backroom Boys for more on that).

Indeed these things are so fundamental to software development that it’s hard to imagine a case for not teaching them – but 50 years ago they were clearly seen as in some way esoteric or even fetishistic.

Professor Stewart’s book concentrates on understanding of the concepts and not rigorous proofs – which makes the book readable and digestible in settings like my bath. I am not going to say it could be mastered by anyone regardless of their current understanding of maths, but its not a traditional textbook and I would suggest it would be a good backgrounder if you had basic (O level/GCSE) maths qualifications and were interested in working as a software developer.

One of the points the author makes in his preface is that the UK has seen its science and technology bases erode as more and more graduates look for careers in law and financial services. I hope that will become the subject of another blog here shortly.

Riscyforth: an update

Last Christmas I started work on a Forth for RISC-V single board computers (SBCs): Riscyforth.

Riscyforth is written in RISC-V assembly, with the aim of providing a good balance between the ease of use that comes from a programming language that can both be run in an interpreted “immediate” mode and compiled if speed is the priority.

Forth is a stack based language – commands operate on what’s on top of the stack, hence:

2 2 +

Computes 2 + 2 and so on. It can sometimes make the code hard to read, but it’s also a powerful and extensible language – very much out of fashion these days but also very hacker (as in coder/experimenter) friendly.

At the time I began no SBCs existed, but they were expected. Since then, one has turned up – the Nezha – while the one we were all expecting, the Beagle V, was cancelled but it is looking better for 2022.

Progress on Riscyforth has been slower than I probably expected, certainly slower than reading the book on “threaded interpreted languages“- my initial inspiration – led me to expect: but the delays in hardware have also given me more time.

So how does it look?

Well, it’s now pretty mature – I have implemented a substantial part of the Forth 2012 core specification (and some other extensions too), but there is still a lot of work to be done:

  • Output is still pretty basic – you get lines of text and there is no support for nCurses type output, never mind graphics.
  • Some basic structures are still missing (such as the Forth equivalent of switch/case – but I have recently finished a restructuring of code/ pay down of technical debt that makes all this much easier to implement.
  • Despite the initial idea of writing this to give developers faster access to the hardware, all the work has gone into the core language and no hardware specific extensions have yet been written.

But it works, and works well. A couple of crude tests show it is doing basic looping about 10 – 20% faster than GForth, the GNU Project’s Forth compiler (which works well for RISC-V) – though you are getting a lot else besides with GForth.

If you are at all interested your comments and – perhaps – participation in the project would be very welcome.

Overflow in unsigned arithmetic in C

This came up in work recently – and I got a bit confused about it, so I thought it was worth writing it down in case someone like me needs to look it up.

Let’s consider a real world example – we have a 16 bit counter that increments on every event X and which we sample periodically: at our last sampling was at value 0xFFFD and at our current sample is 0x03. How many incidents of X have occurred?

Ignoring the possibility of a ‘double rollover’ or anything similar, the answer is 6 (don’t forget the counter goes through 0). But how can we calculate this programmatically?

It turns out it’s very simple and this always works:

uint16_t numberEvents = currentSample - previousSample

In C the calculation (constrained to 16 bits) of 0x03 – 0xFFFD with unsigned arithmetic results in an overflow (we cannot calculate a negative number) but the language standard gives us a well-defined answer that works well.

The answer of -65530 is reinterpreted as modulo 0x10000 (65536) and this comes out at 6. To explain why think of a smaller number system – modulo 5 say. Then the number line (integers and integers mod 5) looks like this:

Integers: 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Modulo 5: 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0 4 3 2 1 0

Under this system we can see that 1 – 4 = -3 for integers but 2 mod 5 – which if this was a simple two bit counter would be the correct number of increments that had taken place.

Game of Life in Scratch

Game of LifeA few days ago I asked for volunteers to read a book I was writing on programming, using Scratch, MIT’s visual, event-driven, programming environment.

I have not yet had any volunteers, though the flurry of online interest did get me to complete the first draft – so alpha testers still needed.

In the meantime, I have also published the version of Conway’s Game of Life the text is based around. It’s a slightly unusual implementation in that the surface of the game is, in effect, spherical┬áin effect a torus – i.e. the edges are effectively a feature of the projection of the surface but left joins to right, top to bottom and so on.

Update: Been pointed out to me that this isn’t a sphere but a torus, and so I have updated the copy.

Squashing the LRU bug

English: Typical actions taken upon a virtual-...
English: Typical actions taken upon a virtual-physical address translation. (Photo credit: Wikipedia)

Just for once I did not rush to an online forum and say I had found a bug in a product – and I was right not too.

Having tried three different cross compiler toolchains I convinced myself that the issue was plainly neither compiler (or, to be more accurate in this case, assembler) output but some process in my code that was causing corruption. And sure enough I found that I was mangling the physical addresses of my page frames.

Thanks to the way OVPsim operates – by default it provides physical mappings on demand for the full 4GB address space of a 32 bit register – this mangling did not generate a page fault, but it did mean that for certain sequences of instructions – particularly those where lots of page faults were likely to occur, memory was being corrupted.

Changing one line of assembly – so that virtual address output was written to virtual address slot and not the physical address slot in my simple list of page table entries fixed that.

So now the code works – at least I think it does!

First working assembly code in 33 years

When I say working, I mean boots into an endless loop with no ouput to the user – but then, that’s what it’s meant to do:

/* 	Microblaze start up code
	Copyright Adrian McMenamin <>, 2014
	Licensed under GPL v3

	.global _start
	.section .vectors.reset, "ax"
	.align 2
	.ent _start
	.type _start, @function

	brai	_actualstart
	.end _start

	.section .vectors.sw_exception, "ax"
	.align 2
	brai	_handle_exception

	.section .vectors.interrupt, "ax"
	.align 2
	brai	_interrupt_handler

	.section .vectors.breakpoint, "ax"
	.align 2
	brai _handle_breakpoint

	.section .vectors.hw_exception, "ax"
	.align 2
	brai	_handle_hwexception

	.section .text
	.global _actualstart
	.align 4
	.ent _actualstart
	.type _actualstart, @function

	mts	rmsr, r0
	mts	rslr, r0
	addi	r8, r0, 0xFFFFFFFF
	mts	rshr, r8

	addik	r3, r0, 0x3F
	mts	rtlbx, r3

	bri 0 /*endless loop if it ever works */

Only have to get it to do something useful now, how hard could that be? (There is a linker script to get this to work also – as it has to be loaded to address 0x00000000. Have a look at my github (mcmenaminadrian) repos.

Curses on ncurses

gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

Every programmer will be familiar with something like this…

A little while back I wrote a program that simulates – crudely but effectively – a multicore NoC device. I use it to model the execution times of different page replacement algorithms.

The input is XML generated via a step by step trace of a working program. The actually instructions being traced do not matter – what I care about are the memory access patterns.

To allow me to test more models more quickly I have now written some R code that generates a semi-random access pattern based, very loosely indeed, on the patterns seen in the real program. The advantage is I can test against a set number of memory accesses but with a range of pseudo-random access patterns, so although I am not running models against the “real” access pattern, neither am I taking three weeks per experiment.

But when I used the artificially generated access patterns, my program crashed with a seg fault. But even more confusingly, when I ran the code in GDB, the GNU Debugger, if I stepped through the code it worked, but I just ran the code in debugger then it crashed just as it did without using the debugger.

After a few hours I realised why – in my artificial patterns, the first thing the first thread does is spawn all the other threads to be used. In real world code, of course, these spawns take place after quite some code has been executed.

Every code spawn causes the ncurses code I am using to update the screen. When using ‘real’ access patterns these updates take place comfortably after all the ncurses environment has been set up (by a separate thread), but in the artificial code, the thread updates are the first thing that get posted to the screen, even before ncurses has been set up – hence the crash.

If I step through the code then the ncurses thread runs ahead and sets up the screen before I hit the thread update code and again it works.

The solution? Use a condition variable and a mutex to ensure that nothing executes before the ncurses environment is fully established.

Not a big deal – but perhaps, at some point in the future someone struggling to understand why their code – which previously worked so well – has now stopped processing what seems to be well-formed input. Hope this helps!

After paging?

Diagram of relationship between the virtual an...
Diagram of relationship between the virtual and physical address spaces. (Photo credit: Wikipedia)

Paging and virtual memory is at the heart of just about any computing device – more complex than a DVD player – we use everyday.

Paging is the memory management system based on the idea that we can divide the real memory of our computer into a sequence of smallish (typically 4,096 bytes) of “page frames” and then load the bits of data and program in and out of those frames (in “pages”) as we need them.

So, you can have pages from various different running programs in the page frames at any given time and then use a “virtual memory” system to map the pages placed in an arbitrary frame to the memory address the program thinks the page should be resident in.

It is not the only system we could use – “segments”, which involve moving large chunks (as opposed to small pages) of memory about is one approach, while “overlays” – using part of the memory space as sort of scratchpad working area – is another. More recently, with bigger “traditional” computers very large pages have been used as a way of making, at least in theory, more efficient use of memory now measured in billions (as opposed to a few tens) of bytes.

But paging is easily the most widely used approach and has been integral to the development of “multitasking” and similar shared resources approaches to computing – because paging allows us to just have the useful bits of a program and its data in memory we can have many more programs “running” at a given time.

But my PhD research is pointing me towards some of the weaknesses of the paging approach.

At the heart of the case for paging is the idea of “locality” in a computer’s use of resources: if you use one memory address at one instant there is a high probability you will use a nearby address very soon: think of any sort of sequential document or record and you can see why that idea is grounded in very many use cases of computing devices.

Locality means that it ought to make sense to read in memory in blocks and not just one little drop at a time.

But this principle may be in opposition to efficient usage of memory when competition for space in fierce: such as for the limited local memory resources we have on a Network-on-Chip computer.

Right now I am collecting data to measure the efficiency of 4k pages on such (simulated) devices. With 16 simulated cores trying to handle up to 18 threads of execution competition for pages is intense and the evidence suggests that they are resident, in some cases at least, for many fewer “ticks” than it takes to load them from the next lowest level in the memory hierarchy. On top of that many pages show that the principle of locality can be quite a weak one – pages of code are, in general, quite likely to demonstrate high locality (especially in loops) but pages of read-write memory may not.

I don’t have all the data to hand – essentially I am transforming one 200GB XML file into another XML file which will likely be around the same size and that takes time, even on quite a high spec computer (especially when you have to share resources with other researchers). But I expect some interesting results.

Enhanced by Zemanta

Passing by reference in C++ – a downside

Code (Photo credit: Riebart)

Once you realise what is happening this is obvious, but it took me a while…

I wanted to write some longs out as character strings in C++, so wrote some code like this:

void writeLongToFile(ofstream& xmlFile, long& value)
stringstream stringy;
stringy << value;
xmlFile << stringy.rdbuf();

But compile time error after compile timer error followed as I was told that I could not use this code with unsigned long or with const long and so on … but surely these are simple conversions I thought…

…Just me thinking like a C programmer again. Passing by reference is just that – there is no copying or conversion available and so long and unsigned long are fundamentally incompatible.

The solution? Just pass by value – afterall the saving in passing a more or less primative type like long by reference must be pretty minimal – passing the stream by reference is the real saving here (actually, it’s also practical – as we want the output to go to the real stream and not a copy: in C we’d use a pointer here).

Enhanced by Zemanta