I have run into a situation in my code where a function returns a double, and it is possible for this double to be a zero, a negative zero, or another value entirely. I need to distinguish between zero and negative zero, but the default double comparison does not. Due to the format of doubles, C++ does not allow for comparison of doubles using bitwise operators, so I am unsure how to procede. How can I distinguish between the two?
Call std::signbit() to determine the state of the sign bit.
Due to the format of doubles, C++ does not allow for comparison of doubles using bitwise operators, so I am unsure how to procede.
First off, C++ doesn't mandate any standard for float/double types.
Assuming you're talking about IEEE 754's binary32 and binary64 formats, they're specifically designed to maintain their order when their bit patterns are interpreted as integers so that a non-FPU can sort them; this is the reason they have a biased exponent.
There're many SO posts discussing such comparisons; here's the most relevant one. A simple check would be
bool is_negative_zero(float val)
{
return ((val == 0.0f) && std::signbit(val));
}
This works since 0.0f == -0.0f, although there're places where the sign makes a difference like atan2 or when dividing by -0 as opposed to +0 leads to the respective infinities.
To test explicitly for a == -0 in C do the following:
if (*((long *)&a) == 0x8000000000000000) {
// a is -0
}
Related
In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero
In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero
I am working on a platform that has terrible stalls when comparing floats to zero. As an optimization I have seen the following code used:
inline bool GreaterThanZero( float value )
{
const int value_as_int = *(int*)&value;
return ( value_as_int > 0 );
}
Looking at the generated assembly the stalls are gone and the function is more performant.
Does this work? I'm confused because all of the optimizations for IEEE tricks use SIGNMASKS and lots of AND/OR operations (https://www.lomont.org/papers/2005/CompareFloat.pdf for example). Does the cast to a signed int help? Testing in a simple harness detects no problems.
Any insight would be good.
The expression *(int*)&value > 0 tests if value is any positive float, from the smallest positive denormal (which has the same representation as 0x00000001) to the largest finite float (with representation 0x7f7fffff) and +inf (which has the same representation as 0x7f800000). The trick detects as positive a number of, but not all, NaN representations (the NaN representations above 0x7f800001). It is fine if you don't care about some values of NaN making the test true.
This all works because of the representation of IEEE 754 formats.
The bit manipulation functions that you saw in the literature for the purpose of emulating IEEE 754 operations were probably aiming for perfect emulation, taking into account the particular behaviors of NaN and signed zeroes. For instance, the variation *(int*)&value >= 0 would not be equivalent to value >= 0.0f, because -0.0f, represented as 0x80000000 as an unsigned int and thus as -0x80000000 as a signed one, makes the latter condition true and the former one false. This can make such functions quite complicated.
Does the cast to a signed int help?
Well, yes, because the sign bits of float and int are in the same place and both indicate a positive number when unset. But the condition value > 0.0f could be implemented by re-interpreting value as an unsigned integer too.
Note: the conversion to int* of the address of value breaks strict aliasing rules, but this may be acceptable if your compiler guarantees that it gives meaning to these programs (perhaps with a command-line option).
On VC++ 2008, ceil(-0.5) is returning -0.0. Is this usual/expected behavior? What is the best practice to avoid printing a -0.0 to i/o streams.
ceil in C++ comes from the C standard library.
The C standard says that if a platform implements IEEE-754 arithmetic, ceil( ) behaves as though its argument were rounded to integral according to the IEEE-754 roundTowardPositive rounding attribute. The IEEE-754 standard says (clause 6.3):
the sign of the result of conversions,
the quantize operation, the
roundToIntegral operations, and the
roundToIntegralExact is the sign of
the first or only operand.
So the sign of the result should always match the sign of the input. For inputs in the range (-1,0), this means that the result should be -0.0.
This is correct behavior. See Unary Operator-() on zero values - c++ and http://en.wikipedia.org/wiki/Signed_zero
I am partial to doing a static_cast<int>(ceil(-0.5)); but I don't claim that is "best practice".
Edit: You could of course cast to whatever integral type was appropriate (uint64_t, long, etc.)
I can't say that I know that it is usual, but as for avoiding printing it, implement a check, something like this:
if(var == -0.0)
{
var = 0.0;
}
// continue
Yes this is usual.
int number = (int) ceil(-0.5);
number will be 0
I see why ceil(-0.5) is returning -0.0. It's because for negative numbers, ceil is returning the integer part of the operand:
double af8IntPart;
double af8FracPart = modf(-0.5, & af8IntPart);
cout << af8IntPart;
The output here is "-0.0"
Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
DBL_MIN in float.h is the smallest positive number.
-DBL_MAX in ANSI C, which is defined in float.h.
Floating point numbers (IEEE 754) are symmetrical, so if you can represent the greatest value (DBL_MAX or numeric_limits<double>::max()), just prepend a minus sign.
And then is the cool way:
double f;
(*((uint64_t*)&f))= ~(1LL<<52);
In C, use
#include <float.h>
const double lowest_double = -DBL_MAX;
In C++pre-11, use
#include <limits>
const double lowest_double = -std::numeric_limits<double>::max();
In C++11 and onwards, use
#include <limits>
constexpr double lowest_double = std::numeric_limits<double>::lowest();
Try this:
-1 * numeric_limits<double>::max()
Reference: numeric_limits
This class is specialized for each of
the fundamental types, with its
members returning or set to the
different values that define the
properties that type has in the
specific platform in which it
compiles.
Are you looking for actual infinity or the minimal finite value? If the former, use
-numeric_limits<double>::infinity()
which only works if
numeric_limits<double>::has_infinity
Otherwise, you should use
numeric_limits<double>::lowest()
which was introduces in C++11.
If lowest() is not available, you can fall back to
-numeric_limits<double>::max()
which may differ from lowest() in principle, but normally doesn't in practice.
A truly portable C++ solution
As from C++11 you can use numeric_limits<double>::lowest().
According to the standard, it returns exactly what you're looking for:
A finite value x such that there is no other finite value y where y < x.
Meaningful for all specializations in which is_bounded != false.
Online demo
Lots of non portable C++ answers here !
There are many answers going for -std::numeric_limits<double>::max().
Fortunately, they will work well in most of the cases. Floating point encoding schemes decompose a number in a mantissa and an exponent and most of them (e.g. the popular IEEE-754) use a distinct sign bit, which doesn't belong to the mantissa. This allows to transform the largest positive in the smallest negative just by flipping the sign:
Why aren't these portable ?
The standard doesn't impose any floating point standard.
I agree that my argument is a little bit theoretic, but suppose that some excentric compiler maker would use a revolutionary encoding scheme with a mantissa encoded in some variations of a two's complement. Two's complement encoding are not symmetric. for example for a signed 8 bit char the maximum positive is 127, but the minimum negative is -128. So we could imagine some floating point encoding show similar asymmetric behavior.
I'm not aware of any encoding scheme like that, but the point is that the standard doesn't guarantee that the sign flipping yields the intended result. So this popular answer (sorry guys !) can't be considered as fully portable standard solution ! /* at least not if you didn't assert that numeric_limits<double>::is_iec559 is true */
- std::numeric_limits<double>::max()
should work just fine
Numeric limits
Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
C approach.
Many implementations support +/- infinities, so the most negative double value is -INFINITY.
#include <math.h>
double most_negative = -INFINITY;
Is there a standard and/or portable way ....?
Now we need to also consider other cases:
No infinities
Simply -DBL_MAX.
Only an unsigned infinity.
I'd expect in this case, OP would prefer -DBL_MAX.
De-normal values greater in magnitude than DBL_MAX.
This is an unusual case, likely outside OP's concern. When double is encoded as a pair of a floating points to achieve desired range/precession, (see double-double) there exist a maximum normal double and perhaps a greater de-normal one. I have seen debate if DBL_MAX should refer to the greatest normal, of the greatest of both.
Fortunately this paired approach usually includes an -infinity, so the most negative value remains -INFINITY.
For more portability, code can go down the route
// HUGE_VAL is designed to be infinity or DBL_MAX (when infinites are not implemented)
// .. yet is problematic with unsigned infinity.
double most_negative1 = -HUGE_VAL;
// Fairly portable, unless system does not understand "INF"
double most_negative2 = strtod("-INF", (char **) NULL);
// Pragmatic
double most_negative3 = strtod("-1.0e999999999", (char **) NULL);
// Somewhat time-consuming
double most_negative4 = pow(-DBL_MAX, 0xFFFF /* odd value */);
// My suggestion
double most_negative5 = (-DBL_MAX)*DBL_MAX;
The original question concerns infinity.
So, why not use
#define Infinity ((double)(42 / 0.0))
according to the IEEE definition?
You can negate that of course.
If you do not have float exceptions enabled (which you shouldn't imho), you can simply say:
double neg_inf = -1/0.0;
This yields negative infinity. If you need a float, you can either cast the result
float neg_inf = (float)-1/0.0;
or use single precision arithmetic
float neg_inf = -1.0f/0.0f;
The result is always the same, there is exactly one representation of negative infinity in both single and double precision, and they convert to each other as you would expect.