What is “floating point”?
How do you represent a number like in a compressed and useful way in binary?
Any computer science student will be able to tell you that you use a “floating point” representation and, like me a couple of months ago, could probably move on to a hand-wavy explanation that you stored a representation of the exponent (here from ) and what I am choosing to call the mantissa (a practice denounced in the multi-volume bibles of computing algorithms by Donald Knuth) – here – all in one “number”.
And, to be honest, until recently that, and the knowledge that integer (i.e. counting number-based) arithmetic was much, much faster, was all I needed to know.
What is “Riscyforth”?
I have long-since grown out of the fantasy that Riscyforth will have a mass audience. Indeed there have been moments over that last two years where the idea that widespread adoption of RISC-V – so often touted as the route to an open hardware future to match that of open-source software – has also looked fantastical (though plans for more RISC-V SBCs are now starting to emerge again).
So that means Riscyforth primarily, for now at least, exists for my use, learning and pleasure. I’m not against others using it, indeed I’d love that to happen: being told well over a decade ago that some Perl I’d written was being packaged for a BSD distribution is still a highlight. But in the absence of that I am continuing to pursue the “because it’s there” approach.
Floating point in Forth
Forth’s basic tool is the stack – more or less everything comes and goes via a First In Last Out (FILO) list – so in (integer) Forth the command (known as a ‘word’ in Forth) * multiplies the top two numbers on the stack (consuming them as it does so) and leaves the result on the stack.
That means avoiding, where reasonable and practical, just using existing C library routines and instead trying to write stuff myself, and so it is with parsing floating point numbers off the Forth input stream.
For a Forth floating point calculation we might have this:
3.141 2.56E2 F*
Which means put 3.141 on the stack, then put on the stack and then multiply them (treating them both as floating point numbers) and put the result on the stack.
So my task here is get the character strings “3.141” and “2.56E2” into floating point form and on to the stack. Implementing “F*” is much simpler: just take the numbers off the stack, place them in floating point registers (my RISC-V processor has 32 of these, each 64 bits long) and call the “fmul.d” built-in instruction to multiply the two (double-length or 64 bit) registers together using the floating point circuitry embedded in the central processing unit. That instruction places the result in a floating point register and I then have to place that back on the stack. That can all be done in just six assembly instructions, just as with the integer equivalent: though the calculation is still much, much slower even with the dedicated electronics.
The floating point standard
Floating point numbers are generally represented in today’s computers according to the IEEE754 specification, and as I have already noted I am using the “double” format as standard. Forth’s own standards generally grow out of the 1970s when 16-bit and then 32-bit computing was the norm for even “mainframe” computers, but there seems little sense to me to restrict myself to that when 64-bits, as it were “come for free”.
So I have 64 bits to play with, giving me one bit for the sign, 11 bits for the exponent and 52 bits (the Wikipedia article confusingly says 53, though clarifies why) for the mantissa.
The biggest complication is the exponent/index, because while humans generally count in base 10, your computer and its circuits use base 2 and that includes the exponents. So we don’t want the of but the of and – to make it harder still, when you calculate what should be, you have to change the mantissa too.
How does it work? Let’s look at :
Of course which is represented in binary as and so, multiplying those powers of 2 out, you have this equality:
But while that gives us insight into how to solve this, it’s not the solution because we cannot store multiple exponents in our single representation. Furthermore, if we instead stored the multiplied out exponent we could get no higher (and assuming we used half the available bits to store negative exponents, though dropping that doesn’t actually help) than : for although we could represent -256 to 256, only -100, -10, -1, 1, 10 and 100 would correspond to powers of 10.
Instead we take the highest power here – 6 from , use that as the exponent and add the other factors to the mantissa – recalling that all this is conducted in base 2 (binary):
Because, of course, is half of or, simply, .
Hence we need to calculate the exponent and update the mantissa. The procedure for negative indices is based on similar principles but the details are – as they say – left as an exercise for the reader.
And, as alluded to above, we quickly start to use approximations: as we can only use 11 bits. Similarly, only having 52 bits available for the mantissa requires approximation there too. In fact approximations are overwhelmingly the general case with floating point numbers.
The algorithm in broad form
Here is quick guide to the steps my code takes – I’m not saying this is the best approach, just one that seems to work:
- Firstly, test if this is a valid representation of a floating point number – does it have a decimal point or an “E” for an exponent and does it only have digits, signs and a single . and/or single E. If it’s not valid reject it as a floating point and have the input parser test it for something else.
- Note the sign of the mantissa but conduct all the calculations using positive numbers only.
- Sum the mantissa as though it was an integer – hence 2.54 becomes 254, but noting that we need to adjust the exponent by (in this case) -2.
- Adjust the supplied exponent as indicated in step 3 and then compute the index of highest bit in the adjusted number – that is our power of 2 exponent, whilst the other bits (shortened if necessary) allow us to calculate the additions.
- Compute the adjusted mantissa by adding right shifted products where the bits of the exponent are on: e.g., if the base 2 exponent is 6 and bit 5 of the exponent index calculated in step 4 is on (i.e. 1), add half of the original mantissa (i.e., the original mantissa bit shifted by one place to the right) to the original and store that as the (transitory) updated mantissa – repeat this process for all the bits in the index calculated in step 4 (always using the original mantissa as the basis of the shifts, which grow as we move down the index).
- We use 64 bits to calculate the new mantissa, but now we shorten that 52 by dropping the first and then rounding the numbers that are left – noting that a rounding may require a further adjustment (eg if we round we get and so we have to increment the exponent and store the mantissa as just .
- Adjust the sign bit as necessary depending on the result of 2
- We store the exponent as a positive number only, so we apply a bias. If the exponent is out of range we can return the special negative and positive infinities or just 0.
A semi-worked example
We have adjusted the index back by two, so it’s now and 10 = . Hence exponent is 3.
So we have
Our result has now grown to the right by 1 digit in length (i.e., by a factor of ), we we add 1 back to the exponent and we are storing a representation of :