Apparently identical math expressions with different output - c++

The following code will output different results for variables 'e' and 'f' on a x86 32 bit machine but the same results on a x86 64 bit machine. Why? Theoretically the same expression is being evaluated, but technically it is not.
#include <cstdio>
main()
{
double a,b,c,d,e,f;
a=-8988465674311578540726.0;
b=+8988465674311578540726.0;
c=1925283223.0;
d=4294967296.0;
e=(c/d)*(b-a)+a;
printf("%.80f\n",e);
f=c/d;
f*=(b-a);
f+=a;
printf("%.80f\n",f);
}
Note ... 32 bit x86 code can be generated with 'gcc -m32' ,thanks #Peter Cordes https://stackoverflow.com/users/224132/peter-cordes
See also
is boost::random::uniform_real_distribution supposed to be the same across processors?
--- update for user Madivad
64 bit output
-930037765265417043968.00000...
-930037765265417043968.00000...
32 bit output
-930037765265416519680.00000...
-930037765265417043968.00000...
The "mathematically correct" output can be given by this python code
from fractions import Fraction
a=-8988465674311578540726
b=8988465674311578540726
c=1925283223
d=4294967296
print "%.80f" % float(Fraction(c,d)*(b-a)+a)
-930037765265416519680.000...

FLT_EVAL_METHOD.
C allows intermediate FP calculations to occur at higher/wider types depending on FLT_EVAL_METHOD. So when wider types are used and code flow differs, though mathematically equal, slightly different results may occur.
Except for assignment and cast (which remove all extra range and precision), the values yielded by operators with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. The use of evaluation formats is characterized
by the implementation-defined value of FLT_EVAL_METHOD:
-1. indeterminable;
0. evaluate all operations and constants just to the range and precision of the type;
1. evaluate operations and constants of type float and double to the range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2. evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
[Edit]
#Pascal Cuoq has a useful comment on the veracity on FLT_EVAL_METHOD. In any case, FP code, optimized different along various code paths, may present different results. This may occur when FLT_EVAL_METHOD != 0 or compiler is not strictly conforming.
Concerning a detail of the post: the operation X*Y + Z done in 2 operations of * and then + could be contrasted with fma() which "compute (x × y) + z, rounded as one ternary operation: they compute the value (as if) to infinite precision and round once to the result format, according to the current rounding mode." C11 §7.12.13.1 2. Another candidate for the difference in results could be due to the application "fma" to the line e=(c/d)*(b-a)+a;

Related

Does the same floating-point calculation producing different results when performed twice indicate IEEE 754 non-conformance?

In a specific online judge running 32-bit GCC 7.3.0, this:
#include <iostream>
volatile float three = 3.0f, seven = 7.0f;
int main()
{
float x = three / seven;
std::cout << x << '\n';
float y = three / seven;
std::cout << (x == y) << '\n';
}
Outputs
0.428571
0
To me this seems like it violates IEEE 754, since the standard requires basic operations to be correctly rounded. Now I know there are a couple reasons for IEEE 754 floating-point calculations to be non-deterministic as discussed here, but I don't see how any of them applies to this example. Here are some of the things I considered:
Excess precision and contraction: I'm doing a single calculation and assigning the result to a float, which should force both of the values to be rounded to float precision.
Compile-time calculations: three and seven are volatile so both calculations must be done at runtime.
Floating-point flags: The calculations are done in the same thread almost immediately after each other, so the flags should be the same.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Also, removing the statement printing x, adding a statement to print y, or making y volatile all changes the result. This seems to contradict my understanding of the C++ standard which I think requires the assignments to round off any excess precision.
Thanks to geza for pointing out that this is a known issue. I would still like a definitive answer on whether this conforms to the C++ standard and IEEE 754 though, since the C++ standard appears to require assignments to round off excess precision. Here's the quote from draft N4860 [expr.pre]:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.50
50) The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8 and 7.6.19.
Does this necessarily indicate that the online judge system doesn't conform to IEEE 754?
Yes, with minor caveats.
One, C++ cannot just “conform” to IEEE 754. There has to be some specification of how things in C++ bind (connect) to IEEE 754, such as statements that the float format is IEEE-754 binary32, that x / y uses IEEE-754 division, and so on. C++ 2017 draft N4659 refers to LIA-1, but I do not see that it clearly requires LIA-1 be used even if std::numeric_limits<float>::is_iec559 reports true, and LIA-1 apparently only suggests language bindings.
The C++ standard tells us the fact that std::numeric_limits<float>::is_iec559 reports true means the float type conforms to ISO/IEC/IEEE 60559, which is effectively IEEE 754-2008. But, in addition to the binding problem, I do not see a statement in the C++ standard that nullifies 8 [expr] 13 (“The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.”) when is_iec559 is true. Although it is true that the cast and conversion operators must “perform their specific conversions” (footnote 64), and this forces float y = three / seven; to produce the correct IEEE-754 binary32 results even if binary64 or Intel’s 80-bit floating-point are used for the division, it might not force it to produce the correct result if only a little excess precision is used. (If at least 48 bits of precision are used, no double-rounding errors occur for division when rounded to the 24-bits of the binary32 format. If fewer excess bits are used, there may be some cases that experience double rounding errors.)
I believe the intent of is_iec559 is to indicate a sensible binding, and the behavior shown in the question does violate this. In particular, the defect shown in the question is caused by failing to round the excess precision used in the division to the actual float type; it is not caused by the hypothetical use of less-than-enough excess precision mentioned above.

C - Summing double variables gives different outputs depending on how it's written

I have these variables:
double a = 5;
double b = 1E20;
double c = -b;
and I have summed them like this:
double temp = a+b;
double result = temp+c;
The result equals 0, as expected because 'b' is a big number compared to 'a' and as such, the result doesn't differ from b's number at all, and subtracting it with 'c', which is the same as 'b' but negative, gives us 0.
However, if I try it this way:
double result = (a+b)+c;
The result is actually 8. Why is that?
Presumably you are executing this program on an Intel processor. (When asking questions like this, you should always state which compiler you are using, including the version and the command-line switches, and which system you are running the program on.) Intel processors have a 80-bit floating-point format which has 64-bit significands. (A significand is the fraction portion of a floating-point number.)
It appears your compiler is using the processor’s 80-bit floating-point format for intermediate calculations, and it is probably using the IEEE-754 basic 64-bit binary format for double. The C standard allows C implementations to evaluate floating-point expressions with more range and precision than the nominal type. That means, when the compiler is evaluating (or generating code to evaluate) a double expression, it is allowed to use the 80-bit type.
When a floating-point expression is assigned to an object or there is an explicit cast to a floating-point type, the C standard requires the C implementation to “discard” the excess precision.
The above allows us to see what happened. 1e20 represents 1020, which is a number between 266 and 267. Written in binary, its leading bit is in the position for value 266. Since the 80-bit format has 64-bit significands, the least significant bit that can be represented in the format is at position 23 (having bits from 3 to 66 is 64 bits). After b = 1e20, when you add 5 to b, the result has to be rounded to fit in bits from 266 to 23 (which is 8). This results in rounding the number up to the next multiple of 8. Thus, due to rounding, b+5 has the same result as b+8. Then, when you add c, which equals -b, you get 8.
In double temp = a+b;, the assignment forces the C implementation to “discard” the excess precision. Thus, it must convert the result to the double format, which has 53-bit significands. With a leading bit of 266, the least significant bit is 214. The bits for 213 to 23 are discarded, and the remaining bits are rounded (which does not cause any change in this case, as the discarded bits happen to be less than the midpoint). Thus, although a+b equals b+8, as we saw above, the result of converting b+8 to double is just b. Then adding c to this produces 0.

Is it ok to compare floating points to 0.0 without epsilon?

I am aware, that to compare two floating point values one needs to use some epsilon precision, as they are not exact. However, I wonder if there are edge cases, where I don't need that epsilon.
In particular, I would like to know if it is always safe to do something like this:
double foo(double x){
if (x < 0.0) return 0.0;
else return somethingelse(x); // somethingelse(x) != 0.0
}
int main(){
int x = -3.0;
if (foo(x) == 0.0) {
std::cout << "^- is this comparison ok?" << std::endl;
}
}
I know that there are better ways to write foo (e.g. returning a flag in addition), but I wonder if in general is it ok to assign 0.0 to a floating point variable and later compare it to 0.0.
Or more general, does the following comparison yield true always?
double x = 3.3;
double y = 3.3;
if (x == y) { std::cout << "is an epsilon required here?" << std::endl; }
When I tried it, it seems to work, but it might be that one should not rely on that.
Yes, in this example it is perfectly fine to check for == 0.0. This is not because 0.0 is special in any way, but because you only assign a value and compare it afterwards. You could also set it to 3.3 and compare for == 3.3, this would be fine too. You're storing a bit pattern, and comparing for that exact same bit pattern, as long as the values are not promoted to another type for doing the comparison.
However, calculation results that would mathematically equal zero would not always equal 0.0.
This Q/A has evolved to also include cases where different parts of the program are compiled by different compilers. The question does not mention this, my answer applies only when the same compiler is used for all relevant parts.
C++ 11 Standard,
§5.10 Equality operators
6 If both operands are of arithmetic or enumeration type, the usual
arithmetic conversions are performed on both operands; each of the
operators shall yield true if the specified relationship is true and
false if it is false.
The relationship is not defined further, so we have to use the common meaning of "equal".
§2.13.4 Floating literals
1 [...] If the scaled value is in the range of representable values
for its type, the result is the scaled value if representable, else
the larger or smaller representable value nearest the scaled value,
chosen in an implementation-defined manner. [...]
The compiler has to choose between exactly two values when converting a literal, when the value is not representable. If the same value is chosen for the same literal consistently, you are safe to compare values such as 3.3, because == means "equal".
Yes, if you return 0.0 you can compare it to 0.0; 0 is representable exactly as a floating-point value. If you return 3.3 you have to be a much more careful, since 3.3 is not exactly representable, so a conversion from double to float, for example, will produce a different value.
correction: 0 as a floating point value is not unique, but IEEE 754 defines the comparison 0.0==-0.0 to be true (any zero for that matter).
So with 0.0 this works - for every other number it does not. The literal 3.3 in one compilation unit (e.g. a library) and another (e.g. your application) might differ. The standard only requires the compiler to use the same rounding it would use at runtime - but different compilers / compiler settings might use different rounding.
It will work most of the time (for 0), but is very bad practice.
As long as you are using the same compiler with the same settings (e.g. one compilation unit) it will work because the literal 0.0 or 0.0f will translate to the same bit pattern every time. The representation of zero is not unique though. So if foo is declared in a library and your call to it in some application the same function might fail.
You can rescue this very case by using std::fpclassify to check whether the returned value represents a zero. For every finite (non-zero) value you will have to use an epsilon-comparison though unless you stay within one compilation unit and perform no operations on the values.
As written in both cases you are using identical constants in the same file fed to the same compiler. The string to float conversion the compiler uses should return the same bit pattern so these should not only be equal as in a plus or minus cases for zero thing but equal bit by bit.
Were you to have a constant which uses the operating systems C library to generate the bit pattern then have a string to f or something that can possibly use a different C library if the binary is transported to another computer than the one compiled on. You might have a problem.
Certainly if you compute 3.3 for one of the terms, runtime, and have the other 3.3 computed compile time again you can and will get failures on the equal comparisons. Some constants obviously are more likely to work than others.
Of course as written your 3.3 comparison is dead code and the compiler just removes it if optimizations are enabled.
You didnt specify the floating point format nor standard if any for that format you were interested in. Some formats have the +/- zero problem, some dont for example.
It is a common misconception that floating point values are "not exact". In fact each of them is perfectly exact (except, may be, some special cases as -0.0 or Inf) and equal to s·2e – (p – 1), where s, e, and p are significand, exponent, and precision correspondingly, each of them integer. E.g. in IEEE 754-2008 binary32 format (aka float32) p = 24 and 1 is represented as ‭0x‭800000‬‬·20 – 23. There are two things that are really not exact when you deal with floating point values:
Representation of a real value using a FP one. Obviously, not all real numbers can be represented using a given FP format, so they have to be somehow rounded. There are several rounding modes, but the most commonly used is the "Round to nearest, ties to even". If you always use the same rounding mode, which is almost certainly the case, the same real value is always represented with the same FP one. So you can be sure that if two real values are equal, their FP counterparts are exactly equal too (but not the reverse, obviously).
Operations with FP numbers are (mostly) inexact. So if you have some real-value function φ(ξ) implemented in the computer as a function of a FP argument f(x), and you want to compare its result with some "true" value y, you need to use some ε in comparison, because it is very hard (sometimes even impossible) to white a function giving exactly y. And the value of ε strongly depends on the nature of the FP operations involved, so in each particular case there may be different optimal value.
For more details see D. Goldberg. What Every Computer Scientist Should Know About Floating-Point Arithmetic, and J.-M. Muller et al. Handbook of Floating-Point Arithmetic. Both texts you can find in the Internet.

Differences in rounded result when calling pow()

OK, I know that there was many question about pow function and casting it's result to int, but I couldn't find answer to this a bit specific question.
OK, this is the C code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main()
{
int i = 5;
int j = 2;
double d1 = pow(i,j);
double d2 = pow(5,2);
int i1 = (int)d1;
int i2 = (int)d2;
int i3 = (int)pow(i,j);
int i4 = (int)pow(5,2);
printf("%d %d %d %d",i1,i2,i3,i4);
return 0;
}
And this is the output: "25 25 24 25". Notice that only in third case where arguments to pow are not literals we have that wrong result, probably caused by rounding errors. Same thing happends without explicit casting. Could somebody explain what happens in this four cases?
Im using CodeBlocks in Windows 7, and MinGW gcc compiler that came with it.
The result of the pow operation is 25.0000 plus or minus some bit of rounding error. If the rounding error is positive or zero, 25 will result from the conversion to an integer. If the rounding error is negative, 24 will result. Both answers are correct.
What is most likely happening internally is that in one case a higher-precision, 80-bit FPU value is being used directly and in the other case, the result is being written from the FPU to memory (as a 64-bit double) and then read back in (converting it to a slightly different 80-bit value). This can make a microscopic difference in the final result, which is all it takes to change a 25.0000000001 to a 24.999999997
Another possibility is that your compiler recognizes the constants passed to pow and does the calculation itself, substituting the result for the call to pow. Your compiler may use an internal arbitrary-precision math library or it may just use one that's different.
This is caused by a combination of two problems:
The implementation of pow you are using is not high quality. Floating-point arithmetic is necessarily approximate in many cases, but good implementations take care to ensure that simple cases such as pow(5, 2) return exact results. The pow you are using is returning a result that is less than 25 by an amount greater than 0 but less than or equal to 2–49. For example, it might be returning 25–2-50.
The C implementation you are using sometimes uses a 64-bit floating-point format and sometimes uses an 80-bit floating-point format. As long as the number is kept in the 80-bit format, it retains the complete value that pow returned. If you convert this value to an integer, it produces 24, because the value is less than 25 and conversion to integer truncates; it does not round. When the number is converted to the 64-bit format, it is rounded. Converting between floating-point formats rounds, so the result is rounded to the nearest representable value, 25. After that, conversion to integer produces 25.
The compiler may switch formats whenever it is “convenient” in some sense. For example, there are a limited number of registers with the 80-bit format. When they are full, the compiler may convert some values to the 64-bit format and store them in memory. The compiler may also rearrange expressions or perform parts of them at compile-time instead of run-time, and these can affect the arithmetic performed and the format used.
It is troublesome when a C implementation mixes floating-point formats, because users generally cannot predict or control when the conversions between formats occur. This leads to results that are not easily reproducible and interferes with deriving or controlling numerical properties of software. C implementations can be designed to use a single format throughout and avoid some of these problems, but your C implementation is apparently not so designed.
To add to the other answers here: just generally be very careful when working with floating point values.
I highly recommend reading this paper (even though it is a long read):
http://hal.archives-ouvertes.fr/docs/00/28/14/29/PDF/floating-point-article.pdf
Skip to section 3 for practical examples, but don't neglect the previous chapters!
I'm fairly sure this can be explained by "intermediate rounding" and the fact that pow is not simply looping around j times multiplying by i, but calculating using exp(log(i)*j) as a floating point calculation. Intermediate rounding may well convert 24.999999999996 into 25.000000000 - even arbitrary storing and reloading of the value may cause differences in this sort of behaviuor, so depending on how the code is generated, it may make a difference to the exact result.
And of course, in some cases, the compiler may even "know" what pow actually achieves, and replace the calculation with a constant result.

Floating-point comparison of constant assignment

When comparing doubles for equality, we need to give a tolerance level, because floating-point computation might introduce errors. For example:
double x;
double y;
x = f();
y = g();
if (fabs(x-y)<epsilon) {
// they are equal!
} else {
// they are not!
}
However, if I simply assign a constant value, without any computation, do I still need to check the epsilon?
double x = 1;
double y = 1;
if (x==y) {
// they are equal!
} else {
// no they are not!
}
Is == comparison good enough? Or I need to do fabs(x-y)<epsilon again? Is it possible to introduce error in assigning? Am I too paranoid?
How about casting (double x = static_cast<double>(100))? Is that gonna introduce floating-point error as well?
I am using C++ on Linux, but if it differs by language, I would like to understand that as well.
Actually, it depends on the value and the implementation. The C++ standard (draft n3126) has this to say in 2.14.4 Floating literals:
If the scaled value is in the range of representable values for its type, the result is the scaled value if representable, else the larger or smaller representable value nearest the scaled value, chosen in an implementation-defined manner.
In other words, if the value is exactly representable (and 1 is, in IEEE754, as is 100 in your static cast), you get the value. Otherwise (such as with 0.1) you get an implementation-defined close match (a). Now I'd be very worried about an implementation that chose a different close match based on the same input token but it is possible.
(a) Actually, that paragraph can be read in two ways, either the implementation is free to choose either the closest higher or closest lower value regardless of which is actually the closest, or it must choose the closest to the desired value.
If the latter, it doesn't change this answer however since all you have to do is hardcode a floating point value exactly at the midpoint of two representable types and the implementation is once again free to choose either.
For example, it might alternate between the next higher and next lower for the same reason banker's rounding is applied - to reduce the cumulative errors.
No if you assign literals they should be the same :)
Also if you start with the same value and do the same operations, they should be the same.
Floating point values are non-exact, but the operations should produce consistent results :)
Both cases are ultimately subject to implementation defined representations.
Storage of floating point values and their representations take on may forms - load by address or constant? optimized out by fast math? what is the register width? is it stored in an SSE register? Many variations exist.
If you need precise behavior and portability, do not rely on this implementation defined behavior.
IEEE-754, which is a standard common implementations of floating point numbers abide to, requires floating-point operations to produce a result that is the nearest representable value to an infinitely-precise result. Thus the only imprecision that you will face is rounding after each operation you perform, as well as propagation of rounding errors from the operations performed earlier in the chain. Floats are not per se inexact. And by the way, epsilon can and should be computed, you can consult any numerics book on that.
Floating point numbers can represent integers precisely up to the length of their mantissa. So for example if you cast from an int to a double, it will always be exact, but for casting into into a float, it will no longer be exact for very large integers.
There is one major example of extensive usage of floating point numbers as a substitute for integers, it's the LUA scripting language, which has no integer built-in type, and floating-point numbers are used extensively for logic and flow control etc. The performance and storage penalty from using floating-point numbers turns out to be smaller than the penalty of resolving multiple types at run time and makes the implementation lighter. LUA has been extensively used not only on PC, but also on game consoles.
Now, many compilers have an optional switch that disables IEEE-754 compatibility. Then compromises are made. Denormalized numbers (very very small numbers where the exponent has reached smallest possible value) are often treated as zero, and approximations in implementation of power, logarithm, sqrt, and 1/(x^2) can be made, but addition/subtraction, comparison and multiplication should retain their properties for numbers which can be exactly represented.
The easy answer: For constants == is ok.
There are two exceptions which you should be aware of:
First exception:
0.0 == -0.0
There is a negative zero which compares equal for the IEEE 754 standard. This means
1/INFINITY == 1/-INFINITY which breaks f(x) == f(y) => x == y
Second exception:
NaN != NaN
This is a special caveat of NotaNumber which allows to find out if a number is a NaN
on systems which do not have a test function available (Yes, that happens).