The unreasonable ineffectiveness of arithmetic with floating point numbers

Eugene Wigner’s wrote a famous short essay “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” in 1960. To be honest I find it slightly underwhelming: to me mathematics is the abstraction from physical reality and not – as might be implied by Wigner – that our physical universe is a concretised subset of a higher mathematical world.

Yes, it’s true that mathematics includes much that does not have a 1:1 correspondence with the physical world, but so what? Those things that fall out of the abstraction but are not physically real may indeed be beautiful as Wigner says and therefore worthy of study but I fail to see why that makes mathematics in some way a superior or even a governing science.

And we also know that our mathematics is flawed. There are things we know to be true but which cannot be stated in a consistent mathematical system.

Anyway, I am running away from what I really want to say here – which is actually based on the power of simple arithmetic to solve what seem to be complex computational problems and the frustration one feels when that general expectation breaks down.

I am continuing to work my way through an implementation of floating point numbers in Riscyforth, my (largely – see below) assembly-based implementation of the Forth programming language for RISC-V based “single board computers”.

One of the words I need to implement is “F.” – which as the standard says:

Display, with a trailing space, the top number on the floating-point stack using fixed-point notation:

[-] <digits>.<digits0>

This turns out (like quite a lot of floating point stuff) to be quite complex, as to illustrate:

2.567e34 (2.567\times10^{34} ) is stored in just 64 bits: 0x4713C683CDD0F1E4 but should give you an output requiring 288 bits: 25670000000000000000000000000000000.0

It’s a basic law (via thermodynamics) that in general one can’t get more information out of a system than is input and so this results in approximations and inaccuracies in floating point.

And sure enough my system renders this number as 25670000000000001040660444662064020.0 while a reference implementation gives me 25670000000000001087462021865144320.0. In this case my representation is actually more accurate than the reference by about a factor of 5\times 10^{-18} but both are out by around 1\times10^{-16} – which doesn’t sound like a lot and generally isn’t – it’s about a millimetre on an Earth – Jupiter scale, for instance, but even these small differences start to matter as you multiply them up – which is why planetary probes avoid floating point and prefer arbitrary length integer calculations (or so I am told).

Now 2.567e34 is actually an integer and the maths to generate the number to be displayed are not too complex: the floating point representation is an encoding of 5.6629395354058\times2^{114}, or if you prefer in binary:


And it’s not that hard to convert that to base 10 characters using integers – you repeatedly divide by 10 (1010 in binary) and take the remainder as the character. This is simple to show in base 10 – even if it sounds slightly tautological:

576 / 10 = 57 remainder 6

57 / 10 = 5 remainder 7

5 / 10 = 0 remainder 5

Reverse the output order of 675 and you have 576

That is all great – and so surely producing the digits for the fractional part will be similarly straight forward and an algorithmic process requiring only integer arithmetic available?

No, appears to be the answer, and I find this is deeply perplexing from a philosophical view – hence the idea that arithmetic is “unreasonably ineffective” in this case.

Volume two of Donald Knuth’s masterwork – The Art of Computer Programming – lists (pp 319 – 320 in my 2nd edition) two methods of calculating the fractional part…

(Knuth’s Method 2a):

For radix-B (for us B is 10) fraction U of a binary fraction u ie (.U_{-1}U_{-2}...)_B we have:

U_{-1} = \lfloor uB \rfloor  ... U_{-n} = \lfloor uB^n \rfloor

(where \lfloor X \rfloor translates to X mod B)

Or (Knuth’s method 2b):

Going from 0.u_{-1}...u{-n} we get U as the sum:

((...u_{-n}/b+u_{1-n})/b + ... + u_{-1})/b

The problem with both of these methods is that they involve fractional numbers. Now we could possibly/probably eliminate the fractions by, e.g., using 2b with lots and lots of bit shifting (300 or more in some cases) and storing results across a scratchpad of multiple bytes but it would be horrendously complex (even given that I have opted to implement this all in C as assembly is just too complex/inexpressive).

So I have opted for method 2a and using floating point to decode floating point:

int64_t fpStringProcessFraction(char* buffer, uint64_t mantissa, int64_t power, uint64_t sign, uint64_t radix, uint64_t endIndex)
        /* find smallest digit */
        uint64_t bit = 1;
        uint64_t displacement = 0;
        while ((mantissa & (bit << displacement)) == 0) {
        if (power >= 0) {
                power = 1;
        } else {
                power = -1 * power;
        double fraction = 0.0;
        for (uint i = 63; i >= displacement; i--) {
                if ((mantissa & (bit << i)) != 0) {
                        fraction += 1.0 / pow(2.0, power);
        bool startedOutput = false;
        uint64_t digitsLeft = 15; 
        uint64_t totalOutput = 0;
        uint64_t powerUp = 1;
        while (digitsLeft) {
                double bigResult = fraction * pow(radix, powerUp++);
                char digit = fpStringGetDigit(bigResult, radix);
                if (!startedOutput && digit != 0) {
                        startedOutput = true;
                if (digit > 9) {
                        digit += 55; 
                } else {
                        digit += 48; 
                buffer[endIndex--] = digit;
                if (startedOutput) {
        /* Now reverse the digits */
        uint64_t rightSide = 1023;
        uint64_t leftSide = endIndex + 1;
        while (leftSide < rightSide) {
                char rChar = buffer[rightSide];
                char lChar = buffer[leftSide];
                buffer[leftSide++] = rChar;
                buffer[rightSide--] = lChar;
        /* check for shift left */
        int shiftLeft = 0;
        uint64_t shiftLeftIndex = 1023;
        while (buffer[shiftLeftIndex--] == '0') {
        if (shiftLeft != 0) {
                shiftLeftIndex = 1023;
                uint64_t mvCharIndex = shiftLeftIndex - shiftLeft;
                while (mvCharIndex > endIndex) {
                        buffer[shiftLeftIndex--] = buffer[mvCharIndex--];
                endIndex += shiftLeft;
        return endIndex;

%d bloggers like this: