C++ ceil and negative zero - c++

On VC++ 2008, ceil(-0.5) is returning -0.0. Is this usual/expected behavior? What is the best practice to avoid printing a -0.0 to i/o streams.

ceil in C++ comes from the C standard library.
The C standard says that if a platform implements IEEE-754 arithmetic, ceil( ) behaves as though its argument were rounded to integral according to the IEEE-754 roundTowardPositive rounding attribute. The IEEE-754 standard says (clause 6.3):
the sign of the result of conversions,
the quantize operation, the
roundToIntegral operations, and the
roundToIntegralExact is the sign of
the first or only operand.
So the sign of the result should always match the sign of the input. For inputs in the range (-1,0), this means that the result should be -0.0.

This is correct behavior. See Unary Operator-() on zero values - c++ and http://en.wikipedia.org/wiki/Signed_zero
I am partial to doing a static_cast<int>(ceil(-0.5)); but I don't claim that is "best practice".
Edit: You could of course cast to whatever integral type was appropriate (uint64_t, long, etc.)

I can't say that I know that it is usual, but as for avoiding printing it, implement a check, something like this:
if(var == -0.0)
{
var = 0.0;
}
// continue

Yes this is usual.
int number = (int) ceil(-0.5);
number will be 0

I see why ceil(-0.5) is returning -0.0. It's because for negative numbers, ceil is returning the integer part of the operand:
double af8IntPart;
double af8FracPart = modf(-0.5, & af8IntPart);
cout << af8IntPart;
The output here is "-0.0"

Related

is assigning two doubles guaranteed to yield the same bitset patterns?

There are several posts here about floating point numbers and their nature. It is clear that comparing floats and doubles must always be done cautiously. Asking for equality has also been discussed and the recommendation is clearly to stay away from it.
But what if there is a direct assignement:
double a = 5.4;
double b = a;
assumg a is any non-NaN value - can a == b ever be false?
It seems that the answer is obviously no, yet I can't find any standard defining this behaviour in a C++ environment. IEEE-754 states that two floating point numbers with equal (non-NaN) bitset patterns are equal. Does it now mean that I can continue comparing my doubles this way without having to worry about maintainability? Do I have to worried about other compilers / operating systems and their implementation regarding these lines? Or maybe a compiler that optimizes some bits away and ruins their equality?
I wrote a little program that generates and compares non-NaN random doubles forever - until it finds a case where a == b yields false. Can I compile/run this code anywhere and anytime in the future without having to expect a halt? (ignoring endianness and assuming sign, exponent and mantissa bit sizes / positions stay the same).
#include <iostream>
#include <random>
struct double_content {
std::uint64_t mantissa : 52;
std::uint64_t exponent : 11;
std::uint64_t sign : 1;
};
static_assert(sizeof(double) == sizeof(double_content), "must be equal");
void set_double(double& n, std::uint64_t sign, std::uint64_t exponent, std::uint64_t mantissa) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
convert.sign = sign;
convert.exponent = exponent;
convert.mantissa = mantissa;
memcpy(&n, &convert, sizeof(double_content));
}
void print_double(double& n) {
double_content convert;
memcpy(&convert, &n, sizeof(double));
std::cout << "sign: " << convert.sign << ", exponent: " << convert.exponent << ", mantissa: " << convert.mantissa << " --- " << n << '\n';
}
int main() {
std::random_device rd;
std::mt19937_64 engine(rd());
std::uniform_int_distribution<std::uint64_t> mantissa_distribution(0ull, (1ull << 52) - 1);
std::uniform_int_distribution<std::uint64_t> exponent_distribution(0ull, (1ull << 11) - 1);
std::uniform_int_distribution<std::uint64_t> sign_distribution(0ull, 1ull);
double a = 0.0;
double b = 0.0;
bool found = false;
while (!found){
auto sign = sign_distribution(engine);
auto exponent = exponent_distribution(engine);
auto mantissa = mantissa_distribution(engine);
//re-assign exponent for NaN cases
if (mantissa) {
while (exponent == (1ull << 11) - 1) {
exponent = exponent_distribution(engine);
}
}
//force -0.0 to be 0.0
if (mantissa == 0u && exponent == 0u) {
sign = 0u;
}
set_double(a, sign, exponent, mantissa);
b = a;
//here could be more (unmodifying) code to delay the next comparison
if (b != a) { //not equal!
print_double(a);
print_double(b);
found = true;
}
}
}
using Visual Studio Community 2017 Version 15.9.5
The C++ standard clearly specifies in [basic.types]#3:
For any trivially copyable type T, if two pointers to T point to distinct T objects obj1 and obj2, where neither obj1 nor obj2 is a potentially-overlapping subobject, if the underlying bytes ([intro.memory]) making up obj1 are copied into obj2, obj2 shall subsequently hold the same value as obj1.
It gives this example:
T* t1p;
T* t2p;
// provided that t2p points to an initialized object ...
std::memcpy(t1p, t2p, sizeof(T));
// at this point, every subobject of trivially copyable type in *t1p contains
// the same value as the corresponding subobject in *t2p
The remaining question is what a value is. We find in [basic.fundamental]#12 (emphasis mine):
There are three floating-point types: float, double, and long double.
The type double provides at least as much precision as float, and the type long double provides at least as much precision as double.
The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The value representation of floating-point types is implementation-defined.
Since the C++ standard has no further requirements on how floating point values are represented, this is all you will find as guarantee from the standard, as assignment is only required to preserve values ([expr.ass]#2):
In simple assignment (=), the object referred to by the left operand is modified by replacing its value with the result of the right operand.
As you correctly observed, IEEE-754 requires that non-NaN, non-zero floats compare equal if and only if they have the same bit pattern. So if your compiler uses IEEE-754-compliant floats, you should find that assignment of non-NaN, non-zero floating point numbers preserves bit patterns.
And indeed, your code
double a = 5.4;
double b = a;
should never allow (a == b) to return false. But as soon as you replace 5.4 with a more complicated expression, most of this nicety vanishes. It's not the exact subject of the article, but https://randomascii.wordpress.com/2013/07/16/floating-point-determinism/ mentions several possible ways in which innocent looking code can yield different results (which breaks "identical to the bit pattern" assertions). In particular, you might be comparing an 80 bit intermediate result with a 64 bit rounded result, possibly yielding inequality.
There are some complications here. First, note that the title asks a different question than the question. The title asks:
is assigning two doubles guaranteed to yield the same bitset patterns?
while the question asks:
can a == b ever be false?
The first of these asks whether different bits might occur from an assignment (which could be due to either the assignment not recording the same value as its right operand or due to the assignment using a different bit pattern that represents the same value), while the second asks whether, whatever bits are written by an assignment, the stored value must compare equal to the operand.
In full generality, the answer to the first question is no. Using IEEE-754 binary floating-point formats, there is a one-to-one map between non-zero numeric values and their encodings in bit patterns. However, this admits several cases where an assignment could produce a different bit pattern:
The right operand is the IEEE-754 −0 entity, but +0 is stored. This is not a proper IEEE-754 operation, but C++ is not required to conform to IEEE 754. Both −0 and +0 represent mathematical zero and would satisfy C++ requirements for assignment, so a C++ implementation could do this.
IEEE-754 decimal formats have one-to-many maps between numeric values and their encodings. By way of illustration, three hundred could be represented with bits whose direct meaning is 3•102 or bits whose direct meaning is 300•100. Again, since these represent the same mathematical value, it would be permissible under the C++ standard to store one in the left operand of an assignment when the right operand is the other.
IEEE-754 includes many non-numeric entities called NaNs (for Not a Number), and a C++ implementation might store a NaN different from the right operand. This could include either replacing any NaN with a “canonical” NaN for the implementation or, upon assignment of a signaling Nan, indicating the signal in some way and then converting the signaling NaN to a quiet NaN and storing that.
Non-IEEE-754 formats may have similar issues.
Regarding the latter question, can a == b be false after a = b, where both a and b have type double, the answer is no. The C++ standard does require that an assignment replace the value of the left operand with the value of the right operand. So, after a = b, a must have the value of b, and therefore they are equal.
Note that the C++ standard does not impose any restrictions on the accuracy of floating-point operations (although I see this only stated in non-normative notes). So, theoretically, one might interpret assignment or comparison of floating-point values to be floating-point operations and say that they do not need to be accuracy, so the assignment could change the value or the comparison could return an inaccurate result. I do not believe this is a reasonable interpretation of the standard; the lack of restrictions on floating-point accuracy is intended to allow latitude in expression evaluation and library routines, not simple assignment or comparison.
One should note the above applies specifically to a double object that is assigned from a simple double operand. This should not lull readers into complacency. Several similar but different situations can result in failure of what might seem intuitive mathematically, such as:
After float x = 3.4;, the expression x == 3.4 will generally evaluate as false, since 3.4 is a double and has to be converted to a float for the assignment. That conversion reduces precision and alters the value.
After double x = 3.4 + 1.2;, the expression x == 3.4 + 1.2 is permitted by the C++ standard to evaluate to false. This is because the standard permits floating-point expressions to be evaluated with more precision than the nominal type requires. Thus, 3.4 + 1.2 might be evaluated with the precision of long double. When the result is assigned to x, the standard requires that the excess precision be “discarded,” so the value is converted to a double. As with the float example above, this conversion may change the value. Then the comparison x == 3.4 + 1.2 may compare a double value in x to what is essentially a long double value produced by 3.4 + 1.2.

Behaviour of negative zero (-0.0) in comparison with positive zero (+0.0)

In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero

Is negative zero always equal zero [duplicate]

In my code,
float f = -0.0; // Negative
and compared with negative zero
f == -0.0f
result will be true.
But
float f = 0.0; // Positive
and compared with negative zero
f == -0.0f
also, result will be true instead of false
Why in both cases result to be true?
Here is a MCVE to test it (live on coliru):
#include <iostream>
int main()
{
float f = -0.0;
std::cout<<"==== > " << f <<std::endl<<std::endl;
if(f == -0.0f)
{
std::cout<<"true"<<std::endl;
}
else
{
std::cout<<"false"<<std::endl;
}
}
Output:
==== > -0 // Here print negative zero
true
C++11 introduced functions like std::signbit() which can detect signed zeros, and std::copysign() which can copy the sign bit between floating point values, if the implementation supports signed zero (e.g. due to using IEEE floating point). The specifications of those functions don't actually require that an implementation support distinct positive and negative zeros. That sort of thing aside, I'm unaware of any references in a C++ standard that even mentions signed zeros, let alone what should be the result of comparing them.
The C++ standards also do not stipulate any floating point representation - that is implementation-defined.
Although not definitive, these observations suggest that support of signed zeros, or the result of comparing them, would be determined by what floating point representation the implementation supports.
IEEE-754 is a common (albeit not the only) floating point representation used by modern implementations (i.e. compilers on their host systems). The current (published in 2008) version of IEEE-758 "IEEE Standard for Floating -Point Arithmetic" Section 5.11, second paragraph, says (bold emphasis mine)
Four mutually exclusive relations are possible: less than, equal, greater than, and unordered. The last case arises when at least one operand is NaN. Every NaN shall compare unordered with everything, including itself. Comparisons shall ignore the sign of zero (so +0 = −0). Infinite operands of the same sign shall compare equal.
Floating point arithmetic in C++ is often IEEE-754. This norm differs from the mathematical definition of the real number set.
This norm defines two different representations for the value zero: positive zero and negative zero. It is also defined that those two representations must compare equals, so by definition:
+0.0 == -0.0
As to why it is so, in its paper What Every Computer Scientist Should Know About Floating Point Arithmetic, David Goldberg, 1991-03 (linked in the IEEE-754 page on the IEEE website) writes:
In IEEE arithmetic, it is natural to define log 0 = -∞ and log x to be a NaN when x < 0. Suppose that x represents a small negative number that has underflowed to zero. Thanks to signed zero, x will be negative, so log can return a NaN. However, if there were no signed zero, the log function could not distinguish an underflowed negative number from 0, and would therefore have to return -∞.
That's because the signed negative zero must compare true with zero: i.e. -0.0 == 0.0, -0f == 0f, and -0l == 0l.
It's a requirement of any floating point scheme supported by a C++ compiler.
(Note that most platforms these days use IEEE754 floating point, and this behaviour is explicitly documented in that specification.)
Because 0.0f and -0.0f is same negative of a zero is zero

Distinguish zero and negative zero

I have run into a situation in my code where a function returns a double, and it is possible for this double to be a zero, a negative zero, or another value entirely. I need to distinguish between zero and negative zero, but the default double comparison does not. Due to the format of doubles, C++ does not allow for comparison of doubles using bitwise operators, so I am unsure how to procede. How can I distinguish between the two?
Call std::signbit() to determine the state of the sign bit.
Due to the format of doubles, C++ does not allow for comparison of doubles using bitwise operators, so I am unsure how to procede.
First off, C++ doesn't mandate any standard for float/double types.
Assuming you're talking about IEEE 754's binary32 and binary64 formats, they're specifically designed to maintain their order when their bit patterns are interpreted as integers so that a non-FPU can sort them; this is the reason they have a biased exponent.
There're many SO posts discussing such comparisons; here's the most relevant one. A simple check would be
bool is_negative_zero(float val)
{
return ((val == 0.0f) && std::signbit(val));
}
This works since 0.0f == -0.0f, although there're places where the sign makes a difference like atan2 or when dividing by -0 as opposed to +0 leads to the respective infinities.
To test explicitly for a == -0 in C do the following:
if (*((long *)&a) == 0x8000000000000000) {
// a is -0
}

minimum double value in C/C++

Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
DBL_MIN in float.h is the smallest positive number.
-DBL_MAX in ANSI C, which is defined in float.h.
Floating point numbers (IEEE 754) are symmetrical, so if you can represent the greatest value (DBL_MAX or numeric_limits<double>::max()), just prepend a minus sign.
And then is the cool way:
double f;
(*((uint64_t*)&f))= ~(1LL<<52);
In C, use
#include <float.h>
const double lowest_double = -DBL_MAX;
In C++pre-11, use
#include <limits>
const double lowest_double = -std::numeric_limits<double>::max();
In C++11 and onwards, use
#include <limits>
constexpr double lowest_double = std::numeric_limits<double>::lowest();
Try this:
-1 * numeric_limits<double>::max()
Reference: numeric_limits
This class is specialized for each of
the fundamental types, with its
members returning or set to the
different values that define the
properties that type has in the
specific platform in which it
compiles.
Are you looking for actual infinity or the minimal finite value? If the former, use
-numeric_limits<double>::infinity()
which only works if
numeric_limits<double>::has_infinity
Otherwise, you should use
numeric_limits<double>::lowest()
which was introduces in C++11.
If lowest() is not available, you can fall back to
-numeric_limits<double>::max()
which may differ from lowest() in principle, but normally doesn't in practice.
A truly portable C++ solution
As from C++11 you can use numeric_limits<double>::lowest().
According to the standard, it returns exactly what you're looking for:
A finite value x such that there is no other finite value y where y < x.
Meaningful for all specializations in which is_bounded != false.
Online demo
Lots of non portable C++ answers here !
There are many answers going for -std::numeric_limits<double>::max().
Fortunately, they will work well in most of the cases. Floating point encoding schemes decompose a number in a mantissa and an exponent and most of them (e.g. the popular IEEE-754) use a distinct sign bit, which doesn't belong to the mantissa. This allows to transform the largest positive in the smallest negative just by flipping the sign:
Why aren't these portable ?
The standard doesn't impose any floating point standard.
I agree that my argument is a little bit theoretic, but suppose that some excentric compiler maker would use a revolutionary encoding scheme with a mantissa encoded in some variations of a two's complement. Two's complement encoding are not symmetric. for example for a signed 8 bit char the maximum positive is 127, but the minimum negative is -128. So we could imagine some floating point encoding show similar asymmetric behavior.
I'm not aware of any encoding scheme like that, but the point is that the standard doesn't guarantee that the sign flipping yields the intended result. So this popular answer (sorry guys !) can't be considered as fully portable standard solution ! /* at least not if you didn't assert that numeric_limits<double>::is_iec559 is true */
- std::numeric_limits<double>::max()
should work just fine
Numeric limits
Is there a standard and/or portable way to represent the smallest negative value (e.g. to use negative infinity) in a C(++) program?
C approach.
Many implementations support +/- infinities, so the most negative double value is -INFINITY.
#include <math.h>
double most_negative = -INFINITY;
Is there a standard and/or portable way ....?
Now we need to also consider other cases:
No infinities
Simply -DBL_MAX.
Only an unsigned infinity.
I'd expect in this case, OP would prefer -DBL_MAX.
De-normal values greater in magnitude than DBL_MAX.
This is an unusual case, likely outside OP's concern. When double is encoded as a pair of a floating points to achieve desired range/precession, (see double-double) there exist a maximum normal double and perhaps a greater de-normal one. I have seen debate if DBL_MAX should refer to the greatest normal, of the greatest of both.
Fortunately this paired approach usually includes an -infinity, so the most negative value remains -INFINITY.
For more portability, code can go down the route
// HUGE_VAL is designed to be infinity or DBL_MAX (when infinites are not implemented)
// .. yet is problematic with unsigned infinity.
double most_negative1 = -HUGE_VAL;
// Fairly portable, unless system does not understand "INF"
double most_negative2 = strtod("-INF", (char **) NULL);
// Pragmatic
double most_negative3 = strtod("-1.0e999999999", (char **) NULL);
// Somewhat time-consuming
double most_negative4 = pow(-DBL_MAX, 0xFFFF /* odd value */);
// My suggestion
double most_negative5 = (-DBL_MAX)*DBL_MAX;
The original question concerns infinity.
So, why not use
#define Infinity ((double)(42 / 0.0))
according to the IEEE definition?
You can negate that of course.
If you do not have float exceptions enabled (which you shouldn't imho), you can simply say:
double neg_inf = -1/0.0;
This yields negative infinity. If you need a float, you can either cast the result
float neg_inf = (float)-1/0.0;
or use single precision arithmetic
float neg_inf = -1.0f/0.0f;
The result is always the same, there is exactly one representation of negative infinity in both single and double precision, and they convert to each other as you would expect.