Floating point, finally

What is “floating point”?

How do you represent a number like $3.141\times10^1$ in a compressed and useful way in binary?

Any computer science student will be able to tell you that you use a “floating point” representation and, like me a couple of months ago, could probably move on to a hand-wavy explanation that you stored a representation of the exponent (here $1$ from $10^1$) and what I am choosing to call the mantissa (a practice denounced in the multi-volume bibles of computing algorithms by Donald Knuth) – here $3.141$ – all in one “number”.

And, to be honest, until recently that, and the knowledge that integer (i.e. counting number-based) arithmetic was much, much faster, was all I needed to know.

What is “Riscyforth”?

But that changed when I decided I had to add support for floating point numbers to Riscyforth – my assembly-based Forth for single board RISC-V based computers (SBCs).

I have long-since grown out of the fantasy that Riscyforth will have a mass audience. Indeed there have been moments over that last two years where the idea that widespread adoption of RISC-V – so often touted as the route to an open hardware future to match that of open-source software – has also looked fantastical (though plans for more RISC-V SBCs are now starting to emerge again).

So that means Riscyforth primarily, for now at least, exists for my use, learning and pleasure. I’m not against others using it, indeed I’d love that to happen: being told well over a decade ago that some Perl I’d written was being packaged for a BSD distribution is still a highlight. But in the absence of that I am continuing to pursue the “because it’s there” approach.

Floating point in Forth

Forth’s basic tool is the stack – more or less everything comes and goes via a First In Last Out (FILO) list – so in (integer) Forth the command (known as a ‘word’ in Forth) * multiplies the top two numbers on the stack (consuming them as it does so) and leaves the result on the stack.

That means avoiding, where reasonable and practical, just using existing C library routines and instead trying to write stuff myself, and so it is with parsing floating point numbers off the Forth input stream.

For a Forth floating point calculation we might have this:

3.141 2.56E2 F*

Which means put 3.141 on the stack, then put $2.56\times10^2$ on the stack and then multiply them (treating them both as floating point numbers) and put the result on the stack.

So my task here is get the character strings “3.141” and “2.56E2” into floating point form and on to the stack. Implementing “F*” is much simpler: just take the numbers off the stack, place them in floating point registers (my RISC-V processor has 32 of these, each 64 bits long) and call the “fmul.d” built-in instruction to multiply the two (double-length or 64 bit) registers together using the floating point circuitry embedded in the central processing unit. That instruction places the result in a floating point register and I then have to place that back on the stack. That can all be done in just six assembly instructions, just as with the integer equivalent: though the calculation is still much, much slower even with the dedicated electronics.

The floating point standard

Floating point numbers are generally represented in today’s computers according to the IEEE754 specification, and as I have already noted I am using the “double” format as standard. Forth’s own standards generally grow out of the 1970s when 16-bit and then 32-bit computing was the norm for even “mainframe” computers, but there seems little sense to me to restrict myself to that when 64-bits, as it were “come for free”.

So I have 64 bits to play with, giving me one bit for the sign, 11 bits for the exponent and 52 bits (the Wikipedia article confusingly says 53, though clarifies why) for the mantissa.

The biggest complication is the exponent/index, because while humans generally count in base 10, your computer and its circuits use base 2 and that includes the exponents. So we don’t want the $x$ of $10^x$ but the $y$ of $2^y$ and – to make it harder still, when you calculate what $2^y$ should be, you have to change the mantissa too.

How does it work? Let’s look at $2 \times 10^2$:

Of course $10^2 =100$ which is represented in binary as $1100100 = 2^6 + 2^5 + 2^2$ and so, multiplying those powers of 2 out, you have this equality:

$2\times100 = 2\times64+2\times32+2\times4$

But while that gives us insight into how to solve this, it’s not the solution because we cannot store multiple exponents in our single representation. Furthermore, if we instead stored the multiplied out exponent we could get no higher (and assuming we used half the available bits to store negative exponents, though dropping that doesn’t actually help) than $10^2$: for although we could represent -256 to 256, only -100, -10, -1, 1, 10 and 100 would correspond to powers of 10.

Instead we take the highest power here – 6 from $2^6$, use that as the exponent and add the other factors to the mantissa – recalling that all this is conducted in base 2 (binary):

$1_2 \times 2^6+1_2 \times 2^5+1_2 \times 2^2 = 1.1001_2 \times 2^6$

Because, of course, $0.1_2 \times 2^6$ is half of $1_2 \times 2^6$ or, simply, $1_2 \times 2^5$.

Hence we need to calculate the exponent and update the mantissa. The procedure for negative indices is based on similar principles but the details are – as they say – left as an exercise for the reader.

And, as alluded to above, we quickly start to use approximations: as we can only use 11 bits. Similarly, only having 52 bits available for the mantissa requires approximation there too. In fact approximations are overwhelmingly the general case with floating point numbers.

Here is quick guide to the steps my code takes – I’m not saying this is the best approach, just one that seems to work:

1. Firstly, test if this is a valid representation of a floating point number – does it have a decimal point or an “E” for an exponent and does it only have digits, signs and a single . and/or single E. If it’s not valid reject it as a floating point and have the input parser test it for something else.
2. Note the sign of the mantissa but conduct all the calculations using positive numbers only.
3. Sum the mantissa as though it was an integer – hence 2.54 becomes 254, but noting that we need to adjust the exponent by (in this case) -2.
4. Adjust the supplied exponent as indicated in step 3 and then compute the index of highest bit in the adjusted number – that is our power of 2 exponent, whilst the other bits (shortened if necessary) allow us to calculate the additions.
5. Compute the adjusted mantissa by adding right shifted products where the bits of the exponent are on: e.g., if the base 2 exponent is 6 and bit 5 of the exponent index calculated in step 4 is on (i.e. 1), add half of the original mantissa (i.e., the original mantissa bit shifted by one place to the right) to the original and store that as the (transitory) updated mantissa – repeat this process for all the bits in the index calculated in step 4 (always using the original mantissa as the basis of the shifts, which grow as we move down the index).
6. We use 64 bits to calculate the new mantissa, but now we shorten that 52 by dropping the first $1_2$ and then rounding the numbers that are left – noting that a rounding may require a further adjustment (eg if we round $1.1111...1111..._2$ we get $10.0..._2$ and so we have to increment the exponent and store the mantissa as just $0$.
7. Adjust the sign bit as necessary depending on the result of 2
8. We store the exponent as a positive number only, so we apply a bias. If the exponent is out of range we can return the special negative and positive infinities or just 0.

A semi-worked example

$2.54\times10^3$

254 = $11111110_2$

We have adjusted the index back by two, so it’s now $254 \times 10^1$ and 10 = $1010_2$. Hence exponent is 3.

So we have $11111110_2\times 10^1 = 1.111111_2 \times 2^{10} + 1.111111_2 \times 2^8$

Mantissa becomes: $1.111111000_2 + 0.011111110_2 = 10.011110110_2$

Our result has now grown to the right by 1 digit in length (i.e., by a factor of $2^1$), we we add 1 back to the exponent and we are storing a representation of $1.0011110110_2\times 2^{11}$ :

$1.001111011_2 \times 2^{11} = 2^{11} + 2^8+2^7+2^6+2^5 +2^3 + 2^2$
$= 2048 + 256 + 128 + 64 + 32 + 8 + 4 = 2540 = 2.54 \times 10^3$

4 responses to “Floating point, finally”

1. Doing this correctly is surprisingly hard. Even figuring out what “correctly” should mean is hard.

I would look to Velvo Kahan, and his group at Berkeley (I’ve heard him talk about the problem) but I don’t find anything with a quick search.

This looks relevant and plausible: https://www.exploringbinary.com/correct-decimal-to-floating-point-using-big-integers/

That seems to be based on work by Clinger who I trust: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.4152&rep=rep1&type=pdf

I am under the impression that GLIBC gets this stuff right these days. I vaguely recollect that this was due to Berkeley contributions.

1. s/Velvo/Velvel/

2. Thanks for the comment, very much appreciated.
This turned out to be much more difficult than I expected, for sure. I generally have the view/expectation that the interface to the hardware is simple and the hard stuff is in the algorithms implemented in software – this stuff breaks that world view.
Even today – when fixing and extending the article with a semi-worked example I got myself confused over it, and I have been plugging away at this for months now.
I still think I have a few rounding errors (of scale $1:2^{-52}$) for a few corner cases which I need to examine still.
I will look at your references – there is actually little of this stuff out there eg I was surprised that Knuth’s “Art…”, at least in volumes 1 – 3 (and it seems unlikely 4A and 4B will change this though I don’t have a copy of either at present), doesn’t seem to touch on it at all, though he does discuss general accuracy of FP representation.

2. I’ve taken 3.5 courses in Numerical Analysis (mostly in the 1970s). My main take-away was that doing it right, and knowing that you are doing it right, is very hard. Much harder than writing safe C code. This is on every scale.

(In my first assignment in my first course, we were to invert a Hibert Matrix to see how FP errors accumulated. Instead, I implemented rational numbers and had no error to explain.)

Kahan (among other things) is a very good critic of the lower level things. He was the key person behind the Intel 8087 FPU data-types and operations, out of which IEE 754 flowed. He also got IBM to respec the IBM/360’s floating point AFTER some were in the field (adding the “guard digit” — not a small or inexpensive change).

Kahan adds a lot of non-obvious properties to the requirements for elementary operations. The starting point is that most ops must yield the FP value closest to the correct value — not easy, especially in edge cases. Some are allowed a bit more leeway leeway, a whole ULP, if I remember correctly.

One property that you would like is that switching between binary and decimal over and over always yields the same value, except possibly for the initial conversion from decimal.

Another property that you’d like is determinism: with the same input, the output should be the same.

Still another tricky-to-implement property: (monotonicity) a larger or equal input yields a larger or equal result.

Berkeley has created a test suite for floating point. This appears to be a good starting point: http://www.jhauser.us/arithmetic/TestFloat.html