Is this floating-point optimization allowed? - c++

I tried to check out where float loses the ability to exactly represent large integer numbers. So I wrote this little snippet:
int main() {
for (int i=0; ; i++) {
if ((float)i!=i) {
return i;
}
}
}
This code seems to work with all compilers, except clang. Clang generates a simple infinite loop. Godbolt.
Is this allowed? If yes, is it a QoI issue?

Note that the built-in operator != requires its operands to be of the same type, and will achieve that using promotions and conversions if necessary. In other words, your condition is equivalent to:
(float)i != (float)i
That should never fail, and so the code will eventually overflow i, giving your program Undefined Behaviour. Any behaviour is therefore possible.
To correctly check what you want to check, you should cast the result back to int:
if ((int)(float)i != i)

As #Angew pointed out, the != operator needs the same type on both sides.
(float)i != i results in promotion of the RHS to float as well, so we have (float)i != (float)i.
g++ also generates an infinite loop, but it doesn't optimize away the work from inside it. You can see it converts int->float with cvtsi2ss and does ucomiss xmm0,xmm0 to compare (float)i with itself. (That was your first clue that your C++ source doesn't mean what you thought it did like #Angew's answer explains.)
x != x is only true when it's "unordered" because x was NaN. (INFINITY compares equal to itself in IEEE math, but NaN doesn't. NAN == NAN is false, NAN != NAN is true).
gcc7.4 and older correctly optimizes your code to jnp as the loop branch (https://godbolt.org/z/fyOhW1) : keep looping as long as the operands to x != x weren't NaN. (gcc8 and later also checks je to a break out of the loop, failing to optimize based on the fact that it will always be true for any non-NaN input). x86 FP compares set PF on unordered.
And BTW, that means clang's optimization is also safe: it just has to CSE (float)i != (implicit conversion to float)i as being the same, and prove that i -> float is never NaN for the possible range of int.
(Although given that this loop will hit signed-overflow UB, it's allowed to emit literally any asm it wants, including a ud2 illegal instruction, or an empty infinite loop regardless of what the loop body actually was.) But ignoring the signed-overflow UB, this optimization is still 100% legal.
GCC fails to optimize away the loop body even with -fwrapv to make signed-integer overflow well-defined (as 2's complement wraparound). https://godbolt.org/z/t9A8t_
Even enabling -fno-trapping-math doesn't help. (GCC's default is unfortunately to enable
-ftrapping-math even though GCC's implementation of it is broken/buggy.) int->float conversion can cause an FP inexact exception (for numbers too large to be represented exactly), so with exceptions possibly unmasked it's reasonable not to optimize away the loop body. (Because converting 16777217 to float could have an observable side-effect if the inexact exception is unmasked.)
But with -O3 -fwrapv -fno-trapping-math, it's 100% missed optimization not to compile this to an empty infinite loop. Without #pragma STDC FENV_ACCESS ON, the state of the sticky flags that record masked FP exceptions is not an observable side-effect of the code. No int->float conversion can result in NaN, so x != x can't be true.
These compilers are all optimizing for C++ implementations that use IEEE 754 single-precision (binary32) float and 32-bit int.
The bugfixed (int)(float)i != i loop would have UB on C++ implementations with narrow 16-bit int and/or wider float, because you'd hit signed-integer overflow UB before reaching the first integer that wasn't exactly representable as a float.
But UB under a different set of implementation-defined choices doesn't have any negative consequences when compiling for an implementation like gcc or clang with the x86-64 System V ABI.
BTW, you could statically calculate the result of this loop from FLT_RADIX and FLT_MANT_DIG, defined in <climits>. Or at least you can in theory, if float actually fits the model of an IEEE float rather than some other kind of real-number representation like a Posit / unum.
I'm not sure how much the ISO C++ standard nails down about float behaviour and whether a format that wasn't based on fixed-width exponent and significand fields would be standards compliant.
In comments:
#geza I would be interested to hear the resulting number!
#nada: it's 16777216
Are you claiming you got this loop to print / return 16777216?
Update: since that comment has been deleted, I think not. Probably the OP is just quoting the float before the first integer that can't be exactly represented as a 32-bit float. https://en.wikipedia.org/wiki/Single-precision_floating-point_format#Precision_limits_on_integer_values i.e. what they were hoping to verify with this buggy code.
The bugfixed version would of course print 16777217, the first integer that's not exactly representable, rather than the value before that.
(All the higher float values are exact integers, but they're multiples of 2, then 4, then 8, etc. for exponent values higher than the significand width. Many higher integer values can be represented, but 1 unit in the last place (of the significand) is greater than 1 so they're not contiguous integers. The largest finite float is just below 2^128, which is too large for even int64_t.)
If any compiler did exit the original loop and print that, it would be a compiler bug.

Related

Can multiplying a pair of almost-one values ever yield a result of 1.0?

I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.
There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.
You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
}
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
}
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
}
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
...
for (int i = 0; i < MAX_INDEX; i++) {
...
}
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

Fast compare IEEE float greater than zero by cheating

I am working on a platform that has terrible stalls when comparing floats to zero. As an optimization I have seen the following code used:
inline bool GreaterThanZero( float value )
{
const int value_as_int = *(int*)&value;
return ( value_as_int > 0 );
}
Looking at the generated assembly the stalls are gone and the function is more performant.
Does this work? I'm confused because all of the optimizations for IEEE tricks use SIGNMASKS and lots of AND/OR operations (https://www.lomont.org/papers/2005/CompareFloat.pdf for example). Does the cast to a signed int help? Testing in a simple harness detects no problems.
Any insight would be good.
The expression *(int*)&value > 0 tests if value is any positive float, from the smallest positive denormal (which has the same representation as 0x00000001) to the largest finite float (with representation 0x7f7fffff) and +inf (which has the same representation as 0x7f800000). The trick detects as positive a number of, but not all, NaN representations (the NaN representations above 0x7f800001). It is fine if you don't care about some values of NaN making the test true.
This all works because of the representation of IEEE 754 formats.
The bit manipulation functions that you saw in the literature for the purpose of emulating IEEE 754 operations were probably aiming for perfect emulation, taking into account the particular behaviors of NaN and signed zeroes. For instance, the variation *(int*)&value >= 0 would not be equivalent to value >= 0.0f, because -0.0f, represented as 0x80000000 as an unsigned int and thus as -0x80000000 as a signed one, makes the latter condition true and the former one false. This can make such functions quite complicated.
Does the cast to a signed int help?
Well, yes, because the sign bits of float and int are in the same place and both indicate a positive number when unset. But the condition value > 0.0f could be implemented by re-interpreting value as an unsigned integer too.
Note: the conversion to int* of the address of value breaks strict aliasing rules, but this may be acceptable if your compiler guarantees that it gives meaning to these programs (perhaps with a command-line option).

Floating Point, is an equality comparison enough to prevent division by zero?

// value will always be in the range of [0.0 - maximum]
float obtainRatio(float value, float maximum){
if(maximum != 0.f){
return value / maximum;
}else{
return 0.f;
}
}
The range of maximum can be anything, including negative numbers. The range of value can also be anything, though the function is only required to make "sense" when the input is in the range of [0.0 - maximum]. The output should always be in the range of [0.0 - 1.0]
I have two questions that I'm wondering about, with this:
Is this equality comparison enough to ensure the function never divides by zero?
If maximum is a degenerate value (extremely small or extremely large), is there a chance the function will return a result outside of [0.0 - 1.0] (assuming value is in the right range)?
Here is a late answer clarifying some concepts in relation to the question:
Just return value / maximum
In floating-point, division by zero is not a fatal error like integer division by zero is.
Since you know that value is between 0.0 and maximum, the only division by zero that can occur is 0.0 / 0.0, which is defined as producing NaN. The floating-point value NaN is a perfectly acceptable value for function obtainRatio to return, and is in fact a much better exceptional value to return than 0.0, as your proposed version is returning.
Superstitions about floating-point are only superstitions
There is nothing approximate about the definition of <= between floats. a <= b does not sometimes evaluate to true when a is just a little above b. If a and b are two finite float variables, a <= b evaluate to true exactly when the rational represented by a is less than or equal to the rational represented by b. The only little glitch one may perceive is actually not a glitch but a strict interpretation of the rule above: +0.0 <= -0.0 evaluates to true, because “the rational represented by +0.0” and “the rational represented by -0.0” are both 0.
Similarly, there is nothing approximate about == between floats: two finite float variables a and b make a == b true if and only if the rational represented by a and the rational represented by b are the same.
Within a if (f != 0.0) condition, the value of f cannot be a representation of zero, and thus a division by f cannot be a division by zero. The division can still overflow. In the particular case of value / maximum, there cannot be an overflow because your function requires 0 ≤ value ≤ maximum. And we don't need to wonder whether ≤ in the precondition means the relation between rationals or the relation between floats, since the two are essentially the same.
This said
C99 allows extra precision for floating-point expressions, which has been in the past wrongly interpreted by compiler makers as a license to make floating-point behavior erratic (to the point that the program if (m != 0.) { if (m == 0.) printf("oh"); } could be expected to print “oh” in some circumstances).
In reality, a C99 compiler that offers IEEE 754 floating-point and defines FLT_EVAL_METHOD to a nonnegative value cannot change the value of m after it has been tested. The variable m was set to a value representable as float when it was last assigned, and that value either is a representation of 0 or it isn't. Only operations and constants can have excess precision (See the C99 standard, 5.2.4.2.2:8).
In the case of GCC, recent versions do what is proper with -fexcess-precision=standard, implied by -std=c99.
Further reading
David Monniaux's description of the sad state of floating-point in C a few years ago (first version published in 2007). David's report does not try to interpret the C99 standard but describes the reality of floating-point computation in C as it was then, with real examples. The situation has much improved since, thanks to improved standard-compliance in compilers that care and thanks to the SSE2 instruction set that renders the entire issue moot.
The 2008 mailing list post by Joseph S. Myers describing the then current GCC situation with floats in GCC (bad), how he interpreted the standard (good) and how he was implementing his interpretation in GCC (GOOD).
In this case with the limited range, it should be OK. In general a check for zero first will prevent division by zero, but there's still a chance of getting overflow if the divisor is close to zero and the dividend is a large number, but in this case the dividend will be small if the divisor is small (both could be close to zero without causing overflow).

Float addition promoted to double?

I had a small WTF moment this morning. Ths WTF can be summarized with this:
float x = 0.2f;
float y = 0.1f;
float z = x + y;
assert(z == x + y); //This assert is triggered! (Atleast with visual studio 2008)
The reason seems to be that the expression x + y is promoted to double and compared with the truncated version in z. (If i change z to double the assert isn't triggered).
I can see that for precision reasons it would make sense to perform all floating point arithmetics in double precision before converting the result to single precision. I found the following paragraph in the standard (which I guess I sort of already knew, but not in this context):
4.6.1.
"An rvalue of type float can be converted to an rvalue of type double. The value is unchanged"
My question is, is x + y guaranteed to be promoted to double or is at the compiler's discretion?
UPDATE: Since many people has claimed that one shouldn't use == for floating point, I just wanted to state that in the specific case I'm working with, an exact comparison is justified.
Floating point comparision is tricky, here's an interesting link on the subject which I think hasn't been mentioned.
You can't generally assume that == will work as expected for floating point types. Compare rounded values or use constructs like abs(a-b) < tolerance instead.
Promotion is entirely at the compiler's discretion (and will depend on target hardware, optimisation level, etc).
What's going on in this particular case is almost certainly that values are stored in FPU registers at a higher precision than in memory - in general, modern FPU hardware works with double or higher precision internally whatever precision the programmer asked for, with the compiler generating code to make the appropriate conversions when values are stored to memory; in an unoptimised build, the result of x+y is still in a register at the point the comparison is made but z will have been stored out to memory and fetched back, and thus truncated to float precision.
The Working draft for the next standard C++0x section 5 point 11 says
The values of the floating operands and the results of floating expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby
So at the compiler's discretion.
Using gcc 4.3.2, the assertion is not triggered, and indeed, the rvalue returned from x + y is a float, rather than a double.
So it's up to the compiler. This is why it's never wise to rely on exact equality between two floating point values.
The C++ FAQ lite has some further discussion on the topic:
Why is cos(x) != cos(y) even though x == y?
It is the problem since float number to binary conversion does not give accurate precision.
And within sizeof(float) bytes it cant accomodate precise value of float number and arithmetic operation may lead to approximation and hence equality fails.
See below e.g.
float x = 0.25f; //both fits within 4 bytes with precision
float y = 0.50f;
float z = x + y;
assert(z == x + y); // it would work fine and no assert
I would think it would be at the compiler's discretion, but you could always force it with a cast if that was your thinking?
Yet another reason to never directly compare floats.
if (fabs(result - expectedResult) < 0.00001)