Category Archives: Общие технические статьи

Machine FP partial invariance issue

Invariance issue

In computer representation:

“a + b + c” and “a + c + b” is not the same!
(and not the samefor multiplicationas well).

Hallelujah! I finally got that simple fact! After so many years of working in IT industry and software development!Well, Ikind of knew this, but never took it seriously until recently
If you guys are curious how ape dealt with getting bananatask
If you are same late as I am, read bellow.

Floating point machine representation

Usuallyfloating point number is represented as follows:
v = m * (be)

Where

m– is the mantissa, an integer with limited range. For example, for decimal numbers it could be in range from 0 to 99. For 24 bit binary numbers it is in range from 0 to (224-1), or from 0 to 16777215.
b– is the base, usually b = 2, an integer value,
e– is exponent, integer, it could take both negative and positive values.
For example in decimal numbers representation 0.5 is represented as:
0.5 = 5 * 10-1 (here m=5, b=10, e=-1)
For binary numbers 0.5 is 2-1 (m=1, b=2, e=-1)

Some people know, that in order to store bigger numbers we need more space in memory. But bigger precision also requires more memory, for we need mantissa of greater width, and thus we also need more bits to store it.

Integer vs float

While working with regular integer numbers we also having data loss and overflow issues, and yet we’re able to control it. We keep in mind minimum and maximum possible integer results, and this know when overflow might happen.
Floating point numbers is different. AFAIK no sane people control mantissa overflow, except perhaps some really rare cases. So here, better to think it just happens all the time.

Inevitable data loss

It is impossible to store numbers with infinite precision, and thus, data loss is inevitable. It’s obvious, but easy to miss if you had never dealt with some cases.
We can’t work with exact real number “N”…
We only able to work with its nearest machine floating pointrepresentation, fp(N) or:
N* = fp(N)

For mantissa in range 0 .. 999 we have next errors.
Number9999will be stored as
v = fp(9999) = 999e+1 = 9990
(here we lost info about most right “9”)

and number1.001will be stored just as
v = fp(1.001)=1
(here we lost info about most right “1”)

a + b + c

Actually v = a + b + c is performed in two steps:
Step 1: x = a + b
Step 2: v = x + c
Or with respect to fp transformation:
Step 1: x = fp(a + b)
Step 2: v = fp(x + c)
By changing the order of sum components, we in fact change what we’re going to loss on each step. And by changing order of band c we get different data loss, just like a final result.

Examples

Let’s demonstrate it on the next example.
  • mantissa can store up to 2 decimal digits, and thus in range 0 .. 99.
  • base is 10.
  • exponent could be any, for it doesn’t matter here really.
Let’s use values:
a = 99 (m=99, e = 0)
b = 10 (m=1, e = 1)
c = 1 (m=1, e = 0)
And consider the difference of “a+b+c” and “a+c+b”:
a + b +c:
fp(a+b) = fp(99+10) = fp(109) = 100
v = fp( fp(a+b) + c ) = fp(100 + 1) = fp(101) = 100

a + c + b:
fp(a+c) = fp(99+1) = fp(100) = 100
v = fp( fp(a+c) + b ) = fp(100 + 10) = fp(110) = 110
Unbelievable for regular people, but so obvious to programmers (and yet unbelievable):
(a + b + c = 100) ≠ (a + c + b = 110)

Well, to be more correct:
( fp(a + b + c) = 100 ) ≠ ( fp(a + c + b) = 110)

Upd:

As one of solutions, wider mantissa should be used for result, and only after all operation items participated in result, it then may be truncated to fp number with thinner mantissa.
If items have mantissa of N bits, then

  • for sum of M+1 items result should have M+N  bits mantissa,
  • for multiplication of M items result should have M*N bits mantissa.

Real example written on C is below.


example.c

#include 

// Helpers declaration, for implementation scroll down
float getAllOnes(unsigned bits);
unsigned getmantissasaBits();

int main() {

// Determine mantissasa size in bits
unsigned mantissasaBits = getmantissasaBits();

// Considering mantissasa has only 3 bits, we would then need:
// a = 0b10 m=1, e=1
// b = 0b110 m=11, e=1
// c = 0b1000 m=1, e=3

float a = 2,
b = getAllOnes(mantissasaBits) - 1,
c = b + 1;

float ab = a + b;
float ac = a + c;

float abc = a + b + c;
float acb = a + c + b;

printf("n"
"FP partial invariance issue demo:n"
"n"
"mantissasa size = %i bitsn"
"n"
"a = %.1fn"
"b = %.1fn"
"c = %.1fn"
"(a+b) result: %.1fn"
"(a+c) result: %.1fn"
"(a + b + c) result: %.1fn"
"(a + c + b) result: %.1fn"
"---------------------------------n"
"diff(a + b + c, a + c + b) = %.1fnn",
mantissasaBits,
a, b, c,
ab, ac,
abc, acb,
abc - acb);

return 1;
}

// Helpers

float getAllOnes(unsigned bits) {
return (unsigned)((1 << bits) - 1);
}

unsigned getmantissasaBits() {

unsigned sz = 1;
unsigned unbeleivableHugeSize = 1024;
float allOnes = 1;

for (;sz != unbeleivableHugeSize &&
allOnes + 1 != allOnes;
allOnes = getAllOnes(++sz)
) {}

return sz-1;
}

Output

FP partial invariance issue demo:

mantissasa size = 24 bits

a = 2.0
b = 16777214.0
c = 16777215.0
(a+b) result: 16777216.0
(a+c) result: 16777216.0
(a + b + c) result: 33554432.0
(a + c + b) result: 33554430.0
---------------------------------
diff(a + b + c, a + c + b) = 2.0

Please follow and like us: