What is (+0)+(-0) by IEEE floating point standard? - c++

Am I right that any arithmetic operation on any floating numbers is unambiguously defined by IEEE floating point standard? If yes, just for curiosity, what is (+0)+(-0)? And is there a way to check such things in practice, in C++ or other commonly used programming language?

The IEEE 754 rules of arithmetic for signed zeros state that +0.0 + -0.0 depends on the rounding mode. In the default rounding mode, it will be +0.0. When rounding towards -∞, it will be -0.0.
You can check this in C++ like so:
#include <iostream>
int main() {
std::cout << "+0.0 + +0.0 == " << +0.0 + +0.0 << std::endl;
std::cout << "+0.0 + -0.0 == " << +0.0 + -0.0 << std::endl;
std::cout << "-0.0 + +0.0 == " << -0.0 + +0.0 << std::endl;
std::cout << "-0.0 + -0.0 == " << -0.0 + -0.0 << std::endl;
return 0;
}
Output:
+0.0 + +0.0 == 0
+0.0 + -0.0 == 0
-0.0 + +0.0 == 0
-0.0 + -0.0 == -0

My answer deals with IEEE 754:2008, which is the current version of the standard.
In the IEEE 754:2008 standard:
Section 4.3 deals with the rounding of values when performing arithmetic operations in order to fit the bits into the mantissa.
4.3 Rounding-direction attributes
Rounding takes a number regarded as infinitely precise and, if necessary, modifies it to fit in the destination’s format while signaling the inexact exception, underflow, or overflow when appropriate (see 7). Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause.
The rounding-direction attribute affects all computational operations that might be inexact. Inexact numeric floating-point results always have the same sign as the unrounded result.
The rounding-direction attribute affects the signs of exact zero sums (see 6.3), and also affects the thresholds beyond which overflow and underflow are signaled.
Section 6.3 prescribes the value of the sign bit when performing arithmetic with special values (NaN, infinities, +0, -0).
6.3 The sign bit
When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0.
However, x + x = x − (−x) retains the same sign as x even when x is zero.
(emphasis mine)
In other words, (+0) + (-0) = +0 except when the rounding mode is roundTowardNegative, in which case it is (+0) + (-0) = -0.
In the context of C#:
According to §7.7.4 of the C# Language Specification (emphasis mine):
Floating-point addition:
float operator +(float x, float y);
double operator +(double x, double y);
The sum is computed according to the rules of IEEE 754 arithmetic. The following table lists the results of all possible combinations of nonzero finite values, zeros, infinities, and NaN's. In the table, x and y are nonzero finite values, and z is the result of x + y. If x and y have the same magnitude but opposite signs, z is positive zero. If x + y is too large to represent in the destination type, z is an infinity with the same sign as x + y.
+ • x +0 -0 +∞ -∞ NaN
•••••••••••••••••••••••••••••••••••••••••••••
y • z y y +∞ -∞ NaN
+0 • x +0 +0 +∞ -∞ NaN
-0 • x +0 -0 +∞ -∞ NaN
+∞ • +∞ +∞ +∞ +∞ NaN NaN
-∞ • -∞ -∞ -∞ NaN -∞ NaN
NaN • NaN NaN NaN NaN NaN NaN
(+0) + (-0) in C#:
In other words, based on the specification, the addition of two zeros only results in negative zero if both are negative zero. Therefore, the answer to the original question
What is (+0)+(-0) by IEEE floating point standard?
is +0.
Rounding modes in C#:
In case anyone is interested in changing the rounding mode in C#, in "Is there an C# equivalent of c++ fesetround() function?", Hans Passant states:
Never tinker with the FPU control word in C#. It is the worst possible global variable you can imagine. With the standard misery that globals cause, your changes cannot last and will arbitrarily disappear. The internal exception handling code in the CLR resets it when it processes an exception.

Assume standard rounding mode (which you are using if you don't know what a rounding mode is and how to change it).
If the exact result is non-zero but so small that it gets rounded to zero, the result is +0 if the exact result is greater than 0, and -0 if the exact result is less than 0. This situation only happens for multiplication and division, not for addition and subtraction.
There are several cases where the exact result is zero. In that case the result is -0 in the following cases: Adding (-0) + (-0). Subtracting (-0) - (+0). Multiplying where one factor is a zero, and the other factor has the opposite sign (including (+0) * (-0). Dividing a zero by a non-zero number including infinity of the opposite sign. In all other cases, the result is +0.
An unfortunate side effect of this rule is that x + 0.0 is not always identical to x (not when x is -0). On the other hand, x - 0.0 is always identical to x. Also, x * 0.0 may be +0 or -0, depending on x. This prevents some optimisations by compilers that support IEE754 precisely, or makes them more difficult.

The answer, by the IEEE floating point standard, is +0.

Related

Issue related to double precision floating point division in C++

In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.
Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.
To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?
Consider the following code:
int main()
{
double minval = std::numeric_limits<double>::min(), wg = 8038,
ans = floor((wg - minval) / wg); // expecting the answer to round to 0
cout << ans; // but the answer actually resulted as 1!
}
A double typically has around 16 digits of precision.
You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.
8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.
Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.
The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.
As an example, consider the code below.
int main()
{
double b, c, d;
vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
for( int i = 0; i < 9; i++ ) {
b = std::nextafter( a[i], 0 );
c = a[i] - b;
d = 1e-17;
if( (bool)(d > c) )
cout << "True" << "\t";
else
cout << "False" << "\t";
cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
}
return 0;
}
The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.
This can be observed from the output of the C++ program as shown below:
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:
Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause
The statement in further repeated in Section 5.1 of Page 29 as shown below:
Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).
C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].
It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.
I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.
"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.

Why does adding 1 to numeric_limits<float>::min() return 1?

How come subtracting 1 from float max returns a sensible value, but adding 1 to float min returns 1?
I thought that if you added or subtracted a value smaller than the epsilon for that particular magnitude, then nothing would happen and there would be no increase or decrease.
Here is the code I compiled with g++ with no flags and ran on x86_64.
#include <limits>
#include <iostream>
int main() {
float min = std::numeric_limits<float>::min() + 1;
float max = std::numeric_limits<float>::max() - 1;
std::cout << min << std::endl << max << std::endl;
return 0;
}
Outputs this:
1
3.40282e+38
I would expect it to output this:
-3.40282e+38
3.40282e+38
std::numeric_limits<float>::min() returns the smallest normalized positive value. To get the value that has no value lower than it, use std::numeric_limits<float>::lowest().
https://en.cppreference.com/w/cpp/types/numeric_limits/min
min is the smallest-magnitude positive normalized float, a very tiny positive number (about 1.17549e-38), not a negative number with large magnitude. Notice that the - is in the exponent, and this is scientific notation. e-38 means 38 zeros after the decimal point. Try it out on https://www.h-schmidt.net/FloatConverter/IEEE754.html to play with the bits in a binary float.
std::numeric_limits<float>::min() is the minimum magnitude normalized float, not -max. CppReference even has a note about this possibly being surprising.
Do you know why that was picked to be the value for min() rather than the lowest negative value? Seems to be an outlier with regards to all the other types.
Some of the sophistication in numeric_limits<T> like lowest and denorm_min is new in C++11. Most of the choice of what to define mostly followed C. Historical C valued economy and didn't define a lot of different names. (Smaller is better on ancient computers, and also less stuff in the global namespace which is all C had access to.)
Float types are normally1 symmetric around 0 (sign/magnitude representation), so C didn't have a separate named constant for the most-negative float / double / long double. Just FLT_MAX and FLT_MIN CPP macros. C doesn't have templates, so you know when you're writing FP code and can use a - on the appropriate constant if necessary.
If you're only going to have a few named constants, the three most interesting ones are:
FLT_EPSILON tells you about the available precision (mantissa bits): nextafter(1.0, +INF) - 1.0
FLT_MIN / FLT_MAX min (normalized) and max magnitudes of finite floats. This depends mostly on how many exponent bits a float has.
They're not quite symmetric around 1.0 for 2 reasons: all-ones mantissa in FLT_MAX, and gradual underflow (subnormals) taking up the lowest exponent-field (0 with bias), but FLT_MIN ignoring subnormals. FLT_MIN * FLT_MAX is about 3.99999976 for IEEE754 binary32 float. (You normally want to avoid subnormals for performance reasons, and so you have room for gradual underflow, so it makes sense that FLT_MIN isn't denorm_min)
(Fun fact: 0.0 is a special case of a subnormal: exponent field = 0 implying a mantissa of 0.xxx instead of 1.xxx).
Footnote 1: CppReference points out that C++11 std::numeric_limits<T>::lowest() could be different from -max for 3rd-party FP types, but isn't for standard C++ FP types.
lowest is what you wanted: the most-negative finite value. It's consistent across integer and FP types as being the most-negative value, so for example you could use it as an initializer for a templated search loop that uses std::min to find the lowest value in an array.
C++11 also introduced denorm_min, the minimum positive subnormal aka denormal value for FP types. In IEEE754, the object representation has all bits 0 except for a 1 in the low bit of the mantissa.
The float result for 1.0 + 1.17549e-38 (after rounding to the nearest float) is exactly 1.0. min is lower than std::numeric_limits<float>::epsilon so the entire change is lost to rounding error when added to 1.0.
So even if you did print the float with full precision (or as a hex float), it would be 1.0. But you're just printing with the default formatting for cout which rounds to some limited precision, like 6 decimal digits. https://en.cppreference.com/w/cpp/io/manip/setprecision
(An earlier version of the question included the numeric value of min ~= 1.17549e-38; this answer started out addressing that mixup and I haven't bothered to fully rewrite those parts).

When Will static_casting the Result of ceil Compromise the Result?

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

Interchangeability of IEEE 754 floating-point addition and multiplication

Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard, or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
#include <limits>
template <typename T>
T case_add(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
T result(x);
for (size_t i = 1; i < n; ++i)
{
result += x;
}
return result;
}
template <typename T>
T case_mul(T x, size_t n)
{
static_assert(std::numeric_limits<T>::is_iec559, "invalid type");
return x * static_cast<T>(n);
}
Is the addition x + x interchangeable by the multiplication 2 * x in IEEE 754 (IEC 559) floating-point standard
Yes, since they are both mathematically identical, they will give the same result (since the result is exact in floating point).
or more generally speaking is there any guarantee that case_add and case_mul always give exactly the same result?
Not generally, no. From what I can tell, it seems to hold for n <= 5:
n=3: as x+x is exact (i.e. involves no rounding), so (x+x)+x only involves one rounding at the final step.
n=4 (and you're using the default rounding mode) then
if the last bit of x is 0, then x+x+x is exact, and so the results are equal by the same argument as n=3.
if the last 2 bits are 01, then the exact value of x+x+x will have last 2 bits of 1|1 (where | indicates the final bit in the format), which will be rounded up to 0|0. The next addition will give an exact result |01, so the result will be rounded down, cancelling out the previous error.
if the last 2 bits are 11, then the exact value of x+x+x will have last 2 bits of 0|1, which will be rounded down to 0|0. The next addition will give an exact result |11, so the result will be rounded up, again cancelling out the previous error.
n=5 (again, assuming default rounding): since x+x+x+x is exact, it holds for the same reason as n=3.
For n=6 it fails, e.g. take x to be 1.0000000000000002 (the next double after 1.0), in which case 6x is 6.000000000000002 and x+x+x+x+x+x is 6.000000000000001
If n is for example pow(2, 54) then the multiplication will work just fine, but in the addition path once the result value is sufficiently larger than the input x, result += x will yield result.
Yes, but it doesn't hold generally. Multiplication by a number higher than 2 might not give the same results, as you have changed the exponent and can drop a bit if you replace with adds. Multiplication by two can't drop a bit if replaced by add operations, however.
If the accumulator result in case_add becomes too large, adding x will introduce rounding errors. At a certain point, adding x won't have an effect at all. So the functions won't give the same result.
For example if double x = 0x1.0000000000001p0 (hexadecimal float notation):
n case_add case_mul
1 0x1.0000000000001p+0 0x1.0000000000001p+0
2 0x1.0000000000001p+1 0x1.0000000000001p+1
3 0x1.8000000000002p+1 0x1.8000000000002p+1
4 0x1.0000000000001p+2 0x1.0000000000001p+2
5 0x1.4000000000001p+2 0x1.4000000000001p+2
6 0x1.8000000000001p+2 0x1.8000000000002p+2

Division z / (x/n) when n is 0

I have an arithmetic expression, for example:
float z = 8.0
float x = 3.0;
float n = 0;
cout << z / (x/n) + 1 << endl;
Why I get normal answer equal to 1, when it should be "nan", "1.#inf", etc.?
I assume you're using floating point arithmetic (though one can't be sure, because you're not telling us).
IEEE754 floating point semantics work on the extended real line and include infinities on both ends. This makes divisions with non-zero numerator well-defined for any (non-NaN) denominator, "consistent with" (i.e. extending continuously) the usual arithmetic rules: x / n is infinity, and z divided by infinity is zero — just as if you had simplified the expression as n * z / x.
The only genuinely undefined quantities are 0/0 and inf/inf, which are represented by the special value NaN.
The IEEE 754 specifies that 3/0 = Inf (or anything positive instead of 3). 8/Inf gives 0. If you add 1 you'll receive 1. This is because 0 denotes "0 or something very close to it" and Inf "Infinity or very big number". It also allows to perform some operations on limits as it effectively extends the real numbers into by infinities. NaN's are reserved when the limit is not achievable (or not easily computable by simple implementation).
As a side effect you have some strange effects like 0 == -0 but 1/0 == Inf and 1/-0 == -Inf. It is important to remember that FP arithmetic is not normal - for example cos(x) * cos(x) + sin(x) * sin(x) - 1 != 0 even if x != NaN && x != Inf && x != -Inf. For floats and x == 1 the result is -5.9604645e-8. Therefore not all expectation can be easily transferred to it - like division by 0 in this case.
While C/C++ does not mandate that IEE 754 specification will be used for floating point numbers it is the specification right now and is implemented on virtually any hardware and for that reason used by most C/C++ implementations.