In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero
Related
I have two floating point values, a and b. I can guarantee they are values in the domain (0, 1). Is there any circumstance where a * b could equal one? I intend to calculate 1/(1 - a * b), and wish to avoid a divide by zero.
My instinct is that it cannot, because the result should be equal or smaller to a or b. But instincts are a poor replacement for understanding the correct behavior.
I do not get to specify the rounding mode, so if there's a rounding mode where I could get into trouble, I want to know about it.
Edit: I did not specify whether the compiler was IEEE compliant or not because I cannot guarantee that the compiler/CPU running my software will indeed by IEEE compliant.
I have two floating point values, a and b…
Since this says we have “values,” not “variables,” it admits a possibility that 1 - a*b may evaluate to 1. When writing about software, people sometimes use names as placeholders for more complicated expressions. For example, one might have an expression a that is sin(x)/x and an expression b that is 1-y*y and then ask about computing 1 - a*b when the code is actually 1 - (sin(x)/x)*(1-y*y). This would be a problem because C++ allows extra precision to be used when evaluating floating-point expressions.
The most common instances of this is that the compiler uses long double arithmetic while computing expressions containing double operands or it uses a fused multiply-add instructions while computing an expression of the format x + y*z.
Suppose expressions a and b have been computed with excess precision and are positive values less than 1 in that excess precision. E.g., for illustration, suppose double were implemented with four decimal digits but a and b were computed with long double with six decimal digits. a and b could both be .999999. Then a*b is .999998000001 before rounding, .999998 after rounding to six digits. Now suppose that at this point in the computation, the compiler converts from long double to double, perhaps because it decides to store this intermediate value on the stack temporarily while it computes some other things from nearby expressions. Converting it to four-digit double produces 1.000, because that is the four-decimal-digit number nearest .999998. When the compiler later loads this from the stack and continues evaluation, we have 1 - 1.000, and the result is zero.
On the other hand, if a and b are variables, I expect your expression is safe. When a value is assigned to a variable or is converted with a cast operation, the C++ standard requires it to be converted to the nominal type; the result must be a value in the nominal type, without any “extra precision.” Then, given 0 < a < 1 and 0 < b < 1, the mathematical value (that, without floating-point rounding) a•b is less than a and is less than b. Then rounding of a•b to the nominal type cannot produce a value greater than a or b with any IEEE-754 rounding method, so it cannot produce 1. (The only requirement here is that the rounding method never skip over values—it might be constrained to round in a particular direction, upward or downward or toward zero or whatever, but it never goes past a representable value in that direction to get to a value farther away from the unrounded result. Since we know a•b is bounded above by both a and b, rounding cannot produce any result greater than the lesser of a and b.)
Formally, the C++ standard does not impose any requirements on the accuracy of floating-point results. So a C++ implementation could use a bonkers rounding mode that produced 3.14 for .9*.9. Aside from implementations flushing subnormals to zero, I am not aware of any C++ implementations that do not obey the requirement above. Flushing subnormals to zero will not affect calculations in 1 - a*b when a and b are near 1. (In a perverse floating-point format, with an exponent range narrower than the significand and no subnormal values, .9999 could be representable while .0001 is not because the exponent required for it is out of range. Then 1-.9999*.9999, which would produce .0002 in normal four-digit arithmetic, would produce 0 due to underflow. No such formats are in normal hardware.)
So, if a and b are variables, 0 < a < 1 and 0 < b < 1, and your C++ implementation is reasonable (may use extra precision, may flush subnormals, does not use perverse floating-point formats or rounding), then 1 - a*b does not evaluate to zero.
There is a mathematical proof that it will never be >= 1. I don't have it handy.... you may want to ask on the math stack overflow site if you are interested in studying the proof. But your instincts are correct. It will never be >= 1.
Now, we must be careful because floating point arithmetic is only an approximation of math and has limitations. I'm not an expert on these limitations, but the floating-point standard is very carefully designed and provides certain guarantees. I'm pretty sure one of them includes (or implies) that x * y where x < 1 and y < 1 is guaranteed to be < 1.
You can check that even if using the highest float or double that is lower than 1, and multiplying by itself, the result will be lower than 1. Any multiplication of numbers lower than that must give a smaller result.
Here is the code I ran, with the results in comments:
float a = nextafterf(1, 0); // 0.999999940
double b = nextafter(1, 0); // 0.99999999999999989
float c = a * a; // 0.999999881
double d = b * b; // 0.99999999999999978
There are several posts here about floating point numbers and their nature. It is clear that comparing floats and doubles must always be done cautiously. Asking for equality has also been discussed and the recommendation is clearly to stay away from it.
But what if there is a direct assignement:
double a = 5.4;
double b = a;
assumg a is any non-NaN value - can a == b ever be false?
It seems that the answer is obviously no, yet I can't find any standard defining this behaviour in a C++ environment. IEEE-754 states that two floating point numbers with equal (non-NaN) bitset patterns are equal. Does it now mean that I can continue comparing my doubles this way without having to worry about maintainability? Do I have to worried about other compilers / operating systems and their implementation regarding these lines? Or maybe a compiler that optimizes some bits away and ruins their equality?
I wrote a little program that generates and compares non-NaN random doubles forever - until it finds a case where a == b yields false. Can I compile/run this code anywhere and anytime in the future without having to expect a halt? (ignoring endianness and assuming sign, exponent and mantissa bit sizes / positions stay the same).
#include <iostream>
#include <random>
struct double_content {
std::uint64_t mantissa : 52;
std::uint64_t exponent : 11;
std::uint64_t sign : 1;
};
static_assert(sizeof(double) == sizeof(double_content), "must be equal");
void set_double(double& n, std::uint64_t sign, std::uint64_t exponent, std::uint64_t mantissa) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
convert.sign = sign;
convert.exponent = exponent;
convert.mantissa = mantissa;
memcpy(&n, &convert, sizeof(double_content));
}
void print_double(double& n) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
std::cout << "sign: " << convert.sign << ", exponent: " << convert.exponent << ", mantissa: " << convert.mantissa << " --- " << n << '\n';
}
int main() {
std::random_device rd;
std::mt19937_64 engine(rd());
std::uniform_int_distribution<std::uint64_t> mantissa_distribution(0ull, (1ull << 52) - 1);
std::uniform_int_distribution<std::uint64_t> exponent_distribution(0ull, (1ull << 11) - 1);
std::uniform_int_distribution<std::uint64_t> sign_distribution(0ull, 1ull);
double a = 0.0;
double b = 0.0;
bool found = false;
while (!found){
auto sign = sign_distribution(engine);
auto exponent = exponent_distribution(engine);
auto mantissa = mantissa_distribution(engine);
//re-assign exponent for NaN cases
if (mantissa) {
while (exponent == (1ull << 11) - 1) {
exponent = exponent_distribution(engine);
}
}
//force -0.0 to be 0.0
if (mantissa == 0u && exponent == 0u) {
sign = 0u;
}
set_double(a, sign, exponent, mantissa);
b = a;
//here could be more (unmodifying) code to delay the next comparison
if (b != a) { //not equal!
print_double(a);
print_double(b);
found = true;
}
}
}
using Visual Studio Community 2017 Version 15.9.5
The C++ standard clearly specifies in [basic.types]#3:
For any trivially copyable type T, if two pointers to T point to distinct T objects obj1 and obj2, where neither obj1 nor obj2 is a potentially-overlapping subobject, if the underlying bytes ([intro.memory]) making up obj1 are copied into obj2, obj2 shall subsequently hold the same value as obj1.
It gives this example:
T* t1p;
T* t2p;
// provided that t2p points to an initialized object ...
std::memcpy(t1p, t2p, sizeof(T));
// at this point, every subobject of trivially copyable type in *t1p contains
// the same value as the corresponding subobject in *t2p
The remaining question is what a value is. We find in [basic.fundamental]#12 (emphasis mine):
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
Since the C++ standard has no further requirements on how floating point values are represented, this is all you will find as guarantee from the standard, as assignment is only required to preserve values ([expr.ass]#2):
In simple assignment (=), the object referred to by the left operand is modified by replacing its value with the result of the right operand.
As you correctly observed, IEEE-754 requires that non-NaN, non-zero floats compare equal if and only if they have the same bit pattern. So if your compiler uses IEEE-754-compliant floats, you should find that assignment of non-NaN, non-zero floating point numbers preserves bit patterns.
And indeed, your code
double a = 5.4;
double b = a;
should never allow (a == b) to return false. But as soon as you replace 5.4 with a more complicated expression, most of this nicety vanishes. It's not the exact subject of the article, but https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ mentions several possible ways in which innocent looking code can yield different results (which breaks "identical to the bit pattern" assertions). In particular, you might be comparing an 80 bit intermediate result with a 64 bit rounded result, possibly yielding inequality.
There are some complications here. First, note that the title asks a different question than the question. The title asks:
is assigning two doubles guaranteed to yield the same bitset patterns?
while the question asks:
can a == b ever be false?
The first of these asks whether different bits might occur from an assignment (which could be due to either the assignment not recording the same value as its right operand or due to the assignment using a different bit pattern that represents the same value), while the second asks whether, whatever bits are written by an assignment, the stored value must compare equal to the operand.
In full generality, the answer to the first question is no. Using IEEE-754 binary floating-point formats, there is a one-to-one map between non-zero numeric values and their encodings in bit patterns. However, this admits several cases where an assignment could produce a different bit pattern:
The right operand is the IEEE-754 −0 entity, but +0 is stored. This is not a proper IEEE-754 operation, but C++ is not required to conform to IEEE 754. Both −0 and +0 represent mathematical zero and would satisfy C++ requirements for assignment, so a C++ implementation could do this.
IEEE-754 decimal formats have one-to-many maps between numeric values and their encodings. By way of illustration, three hundred could be represented with bits whose direct meaning is 3•102 or bits whose direct meaning is 300•100. Again, since these represent the same mathematical value, it would be permissible under the C++ standard to store one in the left operand of an assignment when the right operand is the other.
IEEE-754 includes many non-numeric entities called NaNs (for Not a Number), and a C++ implementation might store a NaN different from the right operand. This could include either replacing any NaN with a “canonical” NaN for the implementation or, upon assignment of a signaling Nan, indicating the signal in some way and then converting the signaling NaN to a quiet NaN and storing that.
Non-IEEE-754 formats may have similar issues.
Regarding the latter question, can a == b be false after a = b, where both a and b have type double, the answer is no. The C++ standard does require that an assignment replace the value of the left operand with the value of the right operand. So, after a = b, a must have the value of b, and therefore they are equal.
Note that the C++ standard does not impose any restrictions on the accuracy of floating-point operations (although I see this only stated in non-normative notes). So, theoretically, one might interpret assignment or comparison of floating-point values to be floating-point operations and say that they do not need to be accuracy, so the assignment could change the value or the comparison could return an inaccurate result. I do not believe this is a reasonable interpretation of the standard; the lack of restrictions on floating-point accuracy is intended to allow latitude in expression evaluation and library routines, not simple assignment or comparison.
One should note the above applies specifically to a double object that is assigned from a simple double operand. This should not lull readers into complacency. Several similar but different situations can result in failure of what might seem intuitive mathematically, such as:
After float x = 3.4;, the expression x == 3.4 will generally evaluate as false, since 3.4 is a double and has to be converted to a float for the assignment. That conversion reduces precision and alters the value.
After double x = 3.4 + 1.2;, the expression x == 3.4 + 1.2 is permitted by the C++ standard to evaluate to false. This is because the standard permits floating-point expressions to be evaluated with more precision than the nominal type requires. Thus, 3.4 + 1.2 might be evaluated with the precision of long double. When the result is assigned to x, the standard requires that the excess precision be “discarded,” so the value is converted to a double. As with the float example above, this conversion may change the value. Then the comparison x == 3.4 + 1.2 may compare a double value in x to what is essentially a long double value produced by 3.4 + 1.2.
In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero
I am working on a platform that has terrible stalls when comparing floats to zero. As an optimization I have seen the following code used:
inline bool GreaterThanZero( float value )
{
const int value_as_int = *(int*)&value;
return ( value_as_int > 0 );
}
Looking at the generated assembly the stalls are gone and the function is more performant.
Does this work? I'm confused because all of the optimizations for IEEE tricks use SIGNMASKS and lots of AND/OR operations (https://www.lomont.org/papers/2005/CompareFloat.pdf for example). Does the cast to a signed int help? Testing in a simple harness detects no problems.
Any insight would be good.
The expression *(int*)&value > 0 tests if value is any positive float, from the smallest positive denormal (which has the same representation as 0x00000001) to the largest finite float (with representation 0x7f7fffff) and +inf (which has the same representation as 0x7f800000). The trick detects as positive a number of, but not all, NaN representations (the NaN representations above 0x7f800001). It is fine if you don't care about some values of NaN making the test true.
This all works because of the representation of IEEE 754 formats.
The bit manipulation functions that you saw in the literature for the purpose of emulating IEEE 754 operations were probably aiming for perfect emulation, taking into account the particular behaviors of NaN and signed zeroes. For instance, the variation *(int*)&value >= 0 would not be equivalent to value >= 0.0f, because -0.0f, represented as 0x80000000 as an unsigned int and thus as -0x80000000 as a signed one, makes the latter condition true and the former one false. This can make such functions quite complicated.
Does the cast to a signed int help?
Well, yes, because the sign bits of float and int are in the same place and both indicate a positive number when unset. But the condition value > 0.0f could be implemented by re-interpreting value as an unsigned integer too.
Note: the conversion to int* of the address of value breaks strict aliasing rules, but this may be acceptable if your compiler guarantees that it gives meaning to these programs (perhaps with a command-line option).
// value will always be in the range of [0.0 - maximum]
float obtainRatio(float value, float maximum){
if(maximum != 0.f){
return value / maximum;
}else{
return 0.f;
}
}
The range of maximum can be anything, including negative numbers. The range of value can also be anything, though the function is only required to make "sense" when the input is in the range of [0.0 - maximum]. The output should always be in the range of [0.0 - 1.0]
I have two questions that I'm wondering about, with this:
Is this equality comparison enough to ensure the function never divides by zero?
If maximum is a degenerate value (extremely small or extremely large), is there a chance the function will return a result outside of [0.0 - 1.0] (assuming value is in the right range)?
Here is a late answer clarifying some concepts in relation to the question:
Just return value / maximum
In floating-point, division by zero is not a fatal error like integer division by zero is.
Since you know that value is between 0.0 and maximum, the only division by zero that can occur is 0.0 / 0.0, which is defined as producing NaN. The floating-point value NaN is a perfectly acceptable value for function obtainRatio to return, and is in fact a much better exceptional value to return than 0.0, as your proposed version is returning.
Superstitions about floating-point are only superstitions
There is nothing approximate about the definition of <= between floats. a <= b does not sometimes evaluate to true when a is just a little above b. If a and b are two finite float variables, a <= b evaluate to true exactly when the rational represented by a is less than or equal to the rational represented by b. The only little glitch one may perceive is actually not a glitch but a strict interpretation of the rule above: +0.0 <= -0.0 evaluates to true, because “the rational represented by +0.0” and “the rational represented by -0.0” are both 0.
Similarly, there is nothing approximate about == between floats: two finite float variables a and b make a == b true if and only if the rational represented by a and the rational represented by b are the same.
Within a if (f != 0.0) condition, the value of f cannot be a representation of zero, and thus a division by f cannot be a division by zero. The division can still overflow. In the particular case of value / maximum, there cannot be an overflow because your function requires 0 ≤ value ≤ maximum. And we don't need to wonder whether ≤ in the precondition means the relation between rationals or the relation between floats, since the two are essentially the same.
This said
C99 allows extra precision for floating-point expressions, which has been in the past wrongly interpreted by compiler makers as a license to make floating-point behavior erratic (to the point that the program if (m != 0.) { if (m == 0.) printf("oh"); } could be expected to print “oh” in some circumstances).
In reality, a C99 compiler that offers IEEE 754 floating-point and defines FLT_EVAL_METHOD to a nonnegative value cannot change the value of m after it has been tested. The variable m was set to a value representable as float when it was last assigned, and that value either is a representation of 0 or it isn't. Only operations and constants can have excess precision (See the C99 standard, 5.2.4.2.2:8).
In the case of GCC, recent versions do what is proper with -fexcess-precision=standard, implied by -std=c99.
Further reading
David Monniaux's description of the sad state of floating-point in C a few years ago (first version published in 2007). David's report does not try to interpret the C99 standard but describes the reality of floating-point computation in C as it was then, with real examples. The situation has much improved since, thanks to improved standard-compliance in compilers that care and thanks to the SSE2 instruction set that renders the entire issue moot.
The 2008 mailing list post by Joseph S. Myers describing the then current GCC situation with floats in GCC (bad), how he interpreted the standard (good) and how he was implementing his interpretation in GCC (GOOD).
In this case with the limited range, it should be OK. In general a check for zero first will prevent division by zero, but there's still a chance of getting overflow if the divisor is close to zero and the dividend is a large number, but in this case the dividend will be small if the divisor is small (both could be close to zero without causing overflow).