Are floating point contractions only allowed within one expression? - c++

This C/C++ reduced testcase:
int x = ~~~;
const double s = 1.0 / x;
const double r = 1.0 - x * s;
assert(r >= 0); // fail
isn't numerically stable, and hits the assert. The reason is that the last calculation can be done with FMA, which brings r into the negatives.
Clang has enabled FMA by default (since version 14), so it is causing some interesting regressions. Here's a running version: https://godbolt.org/z/avvnnKo5E
Interestingly enough, if one splits the last calculation in two, then the FMA is not emitted and the result is always non-negative:
int x = ~~~;
const double s = 1.0 / x;
const double tmp = x * s;
const double r = 1.0 - tmp;
assert(r >= 0); // OK
Is this a guaranteed behavior of IEEE754 / FP_CONTRACT, or is this playing with fire and one should just find a more numerically stable approach? I can't find any indication that fp contractions are only meant to happen "locally" (within one expression), and a simple split like to above is sufficient to prevent them.
(Of course, in due time, one could also think of replacing the algorithm with a more numerically stable one. Or add a clamp into the [0.0, 1.0] range, but that feels hackish.)

The C++ standard allows floating-point expressions to be evaluated with extra range and precision because C++ 2020 draft N4849 7.1 [expr.pre] 6 says:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
However, note 51 tells us:
The cast and assignment operators must still perform their specific conversions as described in 7.6.1.3, 7.6.3, 7.6.1.8
and 7.6.19.
The meaning of this is that an assignment or a cast must convert the value to the nominal type. So if extra range or precision has been used, that result must be converted to an actual double value when an assignment to a double is performed. (I expect, for this purpose, assignment includes initialization in a definition.)
So 1.0 - x * s may used a fused multiply-add, but const double tmp = x * s; const double r = 1.0 - tmp; must compute a double result for x * s, then subtract that double result from 1.0.
Note that this does not preclude const double tmp = x * s; from using extra precision to compute x * s and then rounding again to get a double result. In rare instances, this can produce a double-rounding error, where the result differs slightly from what one would get by rounding the real-number-arithmetic result of x•s directly to a double. That is unlikely to occur in practice; a C++ implementation would have no reason to compute x * s with extra precision and then round it to double.
Also note that C and C++ implementations do not necessarily conform to the C or C++ standard.

Related

Warning about arithmetic overflow when multiplying numbers

I'm writing a program to calculate the result of numbers:
int main()
{
float a, b;
cin >> a >> b;
float result = b + a * a * 0.4;
cout << result;
}
but I have a warning at a * a and it said Warning C26451 Arithmetic overflow: Using operator '*' on a 4 byte value and then casting the result to a 8 byte value. Cast the value to the wider type before calling operator '*' to avoid overflow (io.2). Sorry if this a newbie question, can anyone help me with this? Thank you!
In the C language as described in the first edition of K&R, all floating-point operations were performed by converting operands to a common type (specifically double), performing operations with that type, and then if necessary converting the result to whatever type was needed. On many platforms, that was the most convenient and space-efficient way of handling floating-point math. While the Standard still allows implementations to behave that way, it also allows implementations to perform floating-point operations on smaller types to be performed using those types directly.
As written, the subexpression a * a * 0.5; would be performed by multiplying a * a together using float type, then multiply by a value 0.5 which is of type double. This latter multiplication would require converting the float result of a * a to double. If e.g. a had been equal to 2E19f, then performing the multiply using type float would yield a value too large to be represented using that type. Had the code instead performed the multiplication using type double, then the result 4E38 would be representable in that type, and the result of multiplying that by 0.5 (i.e. 2E38) would be within the range that is representable by float.
Note that in this particular situation, the use of float for the intermediate computations would only affect the result if a was within narrow ranges of very large or very small. If instead of multiplying by 0.5 one had multiplied by other values, however, the choice of whether to use float or double for the first multiplication could affect the accuracy of rounding. Generally, using double for both multiplies would yield slightly more accurate results, but at the expense of increased execution time. Using float for both may yield better execution speed, but at the result of reduced precision. If the floating-point constant had been something that isn't precisely representable in float, converting to double and multiplying by a double constant may yield slightly more accurate results than using float for everything, but in most cases where one would want that precision, one would also want the increased position that would be achieved by using double for the first multiply as well.
Let's look at the error message.
Using operator '*' on a 4 byte value
It is describing this code:
a * a
Your float is 4 bytes. The result of the multiplication is 4 bytes. And the result of a multiplication may overflow.
and then casting the result to a 8 byte value.
It is describing this code:
(result) * 0.4;
Your result is 4 bytes. 0.4 is a double, which is 8 bytes. C++ will promote your float result to a double before performing this multiplication.
So...
The compiler is observing that you are doing float math that could overflow and then immediately converting the result to a double, making the potential overflow unnecessary.
Change the code to this to remove the float to double to float conversions.
float result = b + a * a * 0.4f;
I read the question as "how to change the code to remove the warning?".
If you take the advice in the warning's text literally:
float result = b + (double)a * a * 0.4;
But this is nonsense — if an overflow happens, your result will probably not fit into float result.
It looks like in your case overflow is not possible, and you feel perfectly fine doing all calculations with float. If so, just write 0.4f instead of 0.4 — then the constant will have float type (instead of double), and the result will also be float.
If you want to "fix" the problem with overflow
double result = b + (double)a * a * 0.4;
But then you must also change the following code, which uses the result. And you don't remove the possibility of overflow, you just make it much less likely.

c++ (double)0.700 * int(1000) => 699 (Not the double precision issue)

using g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
I have tried different typecasting of scaledvalue2 but not until I stored the multiplication in a double variable and then to an int could I get desired result.. but I can't explain why ???
I know double precission(0.6999999999999999555910790149937383830547332763671875) is an issue but I don't understand why one way is OK and the other is not ??
I would expect both to fail if precision is a problem.
I DON'T NEED solution to fix it.. but just a WHY ??
(the problem IS fixed)
void main()
{
double value = 0.7;
int scaleFactor = 1000;
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
std::ostringstream oss;
oss << scaledvalue2;
printf("convert FloatValue[%f] multi with %i to get %f = %i or %i or %i[%s]\r\n",
value,scaleFactor,doubleScaled,scaledvalue1,scaledvalue2,scaledvalue3,oss.str().c_str());
}
or in short:
value = 0.6999999999999999555910790149937383830547332763671875;
int scaledvalue_a = (double)(1000 * value); // = 699??
int scaledvalue_b = (double)(1000 * 0.6999999999999999555910790149937383830547332763671875); // = 700
// scaledvalue_a = 699
// scaledvalue_b = 700
I can't figure out what is going wrong here.
Output :
convert FloatValue[0.700000] multi with 1000 to get 700.000000 = 700 or 699 or 700[699]
vendor_id : GenuineIntel
cpu family : 6
model : 54
model name : Intel(R) Atom(TM) CPU N2600 # 1.60GHz
This is going to be a bit handwaving; I was up too late last night watching the Cubs win the World Series, so don't insist on precision.
The rules for evaluating floating-point expressions are somewhat flexible, and compilers typically treat floating-point expressions even more flexibly than the rules formally allow. This makes evaluation of floating-point expressions faster, at the expense of making the results somewhat less predictable. Speed is important for floating-point calculations. Java initially made the mistake of imposing exact requirements on floating-point expressions and the numerics community screamed with pain. Java had to give in to the real world and relax those requirements.
double f();
double g();
double d = f() + g(); // 1
double dd1 = 1.6 * d; // 2
double dd2 = 1.6 * (f() + g()); // 3
On x86 hardware (i.e., just about every desktop system in existence), floating-point calculations are in fact done with 80 bits of precision (unless you set some switches that kill performance, as Java required), even though double and float are 64 bits and 32 bits, respectively. So for arithmetic operations the operands are converted up to 80 bits and the results are converted back down to 64 or 32 bits. That's slow, so the generated code typically delays doing conversions as long as possible, doing all of the calculation with 80-bit precision.
But C and C++ both require that when a value is stored into a floating-point variable, the conversion has to be done. So, formally, in line //1, the compiler must convert the sum back to 64 bits to store it into the variable d. Then the value of dd1, calculated in line //2, must be computed using the value that was stored into d, i.e., a 64-bit value, while the value of dd2, calculated in line //3, can be calculated using f() + g(), i.e., a full 80-bit value. Those extra bits can make a difference, and the value of dd1 might be different from the value of dd2.
And often the compiler will hang on to the 80-bit value of f() + g() and use that instead of the value stored in d when it calculates the value of dd1. That's a non-conforming optimization, but as far as I know, every compiler does that sort of thing by default. They all have command-line switches to enforce the strictly-required behavior, so if you want slower code you can get it. <g>
For serious number crunching, speed is critical, so this flexibility is welcome, and number-crunching code is carefully written to avoid sensitivity to this kind of subtle difference. People get PhDs for figuring out how to make floating-point code fast and effective, so don't feel bad that the results you see don't seem to make sense. They don't, but they're close enough that, handled carefully, they give correct results without a speed penalty.
Since x86 floating-point unit performs its computations in extended precision floating point type (80 bits wide), the result might easily depend on whether the intermediate values were forcefully converted to double (64-bit floating-point type). In that respect, in non-optimized code it is not unusual to see compilers treat memory writes to double variables literally, but ignore "unnecessary" casts to double applied to temporary intermediate values.
In your example, the first part involves saving the intermediate result in a double variable
double doubleScaled = (double)scaleFactor * value;
int scaledvalue1 = doubleScaled; // = 700
The compiler takes it literally and does indeed store the product in a double variable doubleScaled, which unavoidably requires converting the 80-bit product to double. Later that double value is read from memory again and then converted to int type.
The second part
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
involves conversions that the compiler might see as unnecessary (and they indeed are unnecessary from the point of view of abstract C++ machine). The compiler ignores them, which means that the final int value is generated directly from the 80-bit product.
The presence of that intermediate conversion to double in the first variant (and its absence in the second one) is what causes that difference.
I converted mindriot's example assembly code to Intel syntax to test with Visual Studio. I could only reproduce the error by setting the floating point control word to use extended precision.
The issue is that rounding is performed when converting from extended precision to double precision when storing a double, versus truncation is performed when converting from extended precision to integer when storing an integer.
The extended precision multiply produces a product of 699.999..., but the product is rounded to 700.000... during the conversion from extended to double precision when the product is stored into doubleScaled.
double doubleScaled = (double)scaleFactor * value;
Since doubleScaled == 700.000..., when truncated to integer, it's still 700:
int scaledvalue1 = doubleScaled; // = 700
The product 699.999... is truncated when it's converted into an integer:
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699 ??
My guess here is that the compiler generated a compile time constant 0f 700.000... rather than doing the multiply at run time.
int scaledvalue3 = (double)(1000.0 * 0.7); // = 700
This truncation issue can be avoided by using the round() function from the C standard library.
int scaledvalue2 = (int)round(scaleFactor * value); // should == 700
Depending on compiler and optimization flags, scaledvalue_a, involving a variable, may be evaluated at runtime using your processors floating point instructions, whereas scaledvalue_b, involving constants only, may be evaluated at compile time using a math library (e.g. gcc uses GMP - the GNU Multiple Precision math library for this). The difference you are seeing seems to be the difference between the precision and rounding of the runtime vs compile time evaluation of that expression.
Due to rounding errors, most floating-point numbers end up being slightly imprecise.
For the below double to int conversion use std::ceil() API
int scaledvalue2 = (double)((double)(scaleFactor) * value); // = 699
??

Will the compiler optimize division into multiplication

Depending on this question Floating point division vs floating point multiplication. Division is slower than multiplication due to some reasons.
Will the compiler, usually, replace division by multiplication if it is possibe?
For example:
float a;
// During runtime a=5.4f
float b = a/10.f;
Will it be:
float a;
// During runtime a=5.4f
float b = a*0.1f;
If it is considered a compiler dependable question, I am using VS2013 default compiler. However, it would be nice if I got a generic answer (theoretical validity of this optimization)
No, the compiler is not allowed to do that for the general case: the two operations could produce results that are not bit-identical due to the representation error of the reciprocal.
In your example, 0.1 does not have an exact representation as float. This causes the results of multiplication by 0.1 and division by 10 to differ:
float f = 21736517;
float a = f / 10.f;
float b = f * 0.1f;
cout << (a == b) << endl; // Prints zero
Demo.
Note: As njuffa correctly notes in the comment below, there are situations when the compiler could make some optimizations for a wide set of numbers, as described in this paper. For example, multiplying or dividing by a power of two is equivalent to addition to the exponent portion of the IEEE-754 float representation.

different behaviours for double precision on different compiler

my code is quite simple
double d = 405, g = 9.8, v = 63;
double r = d * g / (v * v);
printf("%s\n",(r>1.0)?"GT":"LE");
and here is my result
g++-mingw32-v4.8.1: LE (the result is EQ indeed)
g++ on ubuntu : GT ( this result comes from my friend, just do not have a linux at hand)
VC++ 11 : GT
C# (.Net 4.5) : GT
Python v2.7.3 :GT (this also comes from my friend)
Haskell (GHCi v7.6.3) : GT
g++-mingw, vc++, c#, haskell are all running on my machine with an i7-2630QM
The Python-win32 version comes from my friend, he also gets an LE from his g++-mingw-3.4.2.
And the ubuntu version comes from another friend...
Only g++ gives me LE, and the others are all GT.
I just want to know which one is wrong, g++ or the rest.
Or what SHOULD it be, GT or LE, in IEEE 754?
The IEEE 754 64-bit binary result is GT.
The two exactly representable values bracketing 9.8 are:
9.7999999999999989341858963598497211933135986328125
9.800000000000000710542735760100185871124267578125
The second one is closer to 9.8, so it should be chosen in the normal rounding mode. It is slightly larger than 9.8, resulting in a product that is slightly larger than would have been produced in real number arithmetic, 3969.00000000000045474735088646411895751953125. The conversion of v to double is exact, as is the v*v multiplication. The result is division of a number slightly greater than 3969 by 3969. The rounded result is 1.0000000000000002220446049250313080847263336181640625
The conversion from decimal fraction to a binary fraction is precise only if the decimal fraction can be summed up by binary fractions like 0.5, 0.25, ..., etc.
The number 9.8 in your example contains the fraction 0.8, which can not be represented as an exact fraction using binary number system. Thus different compilers will give you different results depending on the precision to represent fractional numbers.
Run your program using the number 9.75, then all the compilers will give you the same result, because
0.75 = 0.25 + 0.125 = 2-2 + 2-3
So the number 9.75 can be represented exactly by using binary fractions.
The difference likely occurs when d * g is evaluated, because the mathematical result of that product must be rounded upward to produce a value representable in double, but the long double result is more accurate.
Most compilers convert 405 and 63 to double exactly and convert 9.8 to 9.800000000000000710542735760100185871124267578125, although the C++ standard gives them some leeway. The evaluation of v * v is also generally exact, since the mathematical result is exactly representable.
Commonly, on Intel processors, compilers evaluate d * g in one of two ways: Using double arithmetic or using long double with Intel’s 80-bit floating-point format. When evaluated with double, 405•9.800000000000000710542735760100185871124267578125 produces 3969.00000000000045474735088646411895751953125, and dividing this by 3969 yields a number slightly greater than one.
When evaluated with long double, the product is 3969.000000000000287769807982840575277805328369140625. The product, although greater than 3969, is slightly less, and dividing it by 3969 using long double arithmetic produces
1.000000000000000072533125339280246635098592378199100494384765625. When this value is assigned to r, the compiler is required to convert it to double. (Extra precision may be used only in intermediate expressions, not in assignments or casts.) This value is sufficient close to one that rounding it to double produces one.
You can mitigate some (but not all) of the variation between compilers by using casts or assignments with each individual operation:
double t0 = d * g;
double t1 = v * v;
double r = t0/t1;
To answer your question: since the expression ought to evaluate to "equal", the test r>1.0 should be false, and the result printed should be "LE".
In reality you are running into the problem that a number like 9.8 cannot be represented exactly as a floating point number (there a hundreds of good links on the web to explain why this is so). If you need exact math, you have to use integers. Or bigDecimal. Or some such thing.
I tested the code, this should return GT.
int main() {
double d = 405.0f, g = 9.8f, v = 63.0f;
double r = d * g / (v * v);
printf("%s\n",(r>1.0f)?"GT":"LE");
}
GCC compiler sees 405 as int, so "double d = 405" is actually "double d = (double) 405".

Float addition promoted to double?

I had a small WTF moment this morning. Ths WTF can be summarized with this:
float x = 0.2f;
float y = 0.1f;
float z = x + y;
assert(z == x + y); //This assert is triggered! (Atleast with visual studio 2008)
The reason seems to be that the expression x + y is promoted to double and compared with the truncated version in z. (If i change z to double the assert isn't triggered).
I can see that for precision reasons it would make sense to perform all floating point arithmetics in double precision before converting the result to single precision. I found the following paragraph in the standard (which I guess I sort of already knew, but not in this context):
4.6.1.
"An rvalue of type float can be converted to an rvalue of type double. The value is unchanged"
My question is, is x + y guaranteed to be promoted to double or is at the compiler's discretion?
UPDATE: Since many people has claimed that one shouldn't use == for floating point, I just wanted to state that in the specific case I'm working with, an exact comparison is justified.
Floating point comparision is tricky, here's an interesting link on the subject which I think hasn't been mentioned.
You can't generally assume that == will work as expected for floating point types. Compare rounded values or use constructs like abs(a-b) < tolerance instead.
Promotion is entirely at the compiler's discretion (and will depend on target hardware, optimisation level, etc).
What's going on in this particular case is almost certainly that values are stored in FPU registers at a higher precision than in memory - in general, modern FPU hardware works with double or higher precision internally whatever precision the programmer asked for, with the compiler generating code to make the appropriate conversions when values are stored to memory; in an unoptimised build, the result of x+y is still in a register at the point the comparison is made but z will have been stored out to memory and fetched back, and thus truncated to float precision.
The Working draft for the next standard C++0x section 5 point 11 says
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby
So at the compiler's discretion.
Using gcc 4.3.2, the assertion is not triggered, and indeed, the rvalue returned from x + y is a float, rather than a double.
So it's up to the compiler. This is why it's never wise to rely on exact equality between two floating point values.
The C++ FAQ lite has some further discussion on the topic:
Why is cos(x) != cos(y) even though x == y?
It is the problem since float number to binary conversion does not give accurate precision.
And within sizeof(float) bytes it cant accomodate precise value of float number and arithmetic operation may lead to approximation and hence equality fails.
See below e.g.
float x = 0.25f; //both fits within 4 bytes with precision
float y = 0.50f;
float z = x + y;
assert(z == x + y); // it would work fine and no assert
I would think it would be at the compiler's discretion, but you could always force it with a cast if that was your thinking?
Yet another reason to never directly compare floats.
if (fabs(result - expectedResult) < 0.00001)