Floating point multiplication results in infinity even though rounding to nearest - c++

Consider the following C++ code:
#include <fenv.h>
#include <iostream>
using namespace std;
int main(){
fesetround(FE_TONEAREST);
double a = 0x1.efc7f0001p+376;
double b = -0x1.0fdfdp+961;
double c = a*b;
cout << a << " "<< b << " " << c << endl;
}
The output that I see is
2.98077e+113 -2.06992e+289 -inf
I do not understand why c is infinity. My understanding is that whatever the smallest non-infinity floating point value is, it should be closer to the actual value of a*b than -inf as the minimum non-infinity floating point number is finite and any finite number is closer to any other finite number than negative infinity. Why is infinity outputted here?
This was run on 64bit x86 and the assembly uses SSE instructions. It was compiled with -O0 and happens both with clang and gcc.
The result is the minimum finite floating point if the round towards zero mode is used for floating points. I conclude that the issue is rounding related.

Rounding is not the primary issue here. The infinite result is caused by overflow.
This answer follows the rules in IEEE 754-2019. The real-number-arithmetic product of 1.EFC7F000116•2376 and −1.0FDFD961•216 is around −1.07430C8649FEFE816•21335. In normal conditions, floating-point arithmetic produces a result “as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result…” (IEEE 754-2019 4.3). However, we do not have normal conditions. IEEE 754-2019 7.4 says:
The overflow exception shall be signaled if and only if the destination format’s largest finite number is exceeded in magnitude by what would have been the rounded floating-point result (see 4) were the exponent range unbounded…
In other words, if we rounded the result as if we could have any exponent (so we are just rounding the significand), the result would be −1.07430C8649FF+96416•21338. But the magnitude of that exceeds the largest finite number that double can represent, ±1.FFFFFFFFFFFFF16•21023. Therefore, an overflow exception is signaled, and, since you do not catch the exception, a default result is delivered:
… The default result shall be determined by the rounding-direction attribute and the sign of the intermediate result as follows:
a) roundTiesToEven and roundTiesToAway carry all overflows to ∞ with the sign of the intermediate result…

The behaviour you note (when using the FE_TONEAREST rounding mode) conforms to the IEEE-754 (or the equivalent ISO/IEC/IEEE 60559:2011) Standard. (Your examples imply that your platform uses IEEE-754 representation, as many – if not most – platforms do, these days.)
From this Wikipedia page, footnote #18, which cites IEEE-754 §4.3.1, dealing with the "rounding to nearest" modes:
In the following two rounding-direction attributes, an infinitely
precise result with magnitude at least
bemax(b-½b(1-p)) shall round to ∞ (infinity) with no change in sign.
The 'infinitely precise' result of your a * b calculation does, indeed, have a magnitude greater than the specified value, so the rounding is Standard-conformant.
I can't find a similar IEEE-754 citation for the "round-towards-zero" mode, but this GNU libc manual has this to say:
Round toward zero. All results are rounded to the largest
representable value whose magnitude is less than that of the result.
In other words, if the result is negative it is rounded up; if it is
positive, it is rounded down.
So, again, when using that mode, rounding to -DBL_MAX is appropriate/conformant.

Related

C++ Std::stof doesn't work for float less than FLT_MIN?

I'm having issues with the following line of code:
float temp = std::stof("-8.34416e-46");
My program aborts as soon as it reaches this line. It's clear that this float is less than FLT_MIN, but such floats are allowed to exist. (For example, float temp = -8.34416e-46; works fine. Is the stof method supposed to only work for values between FLT_MIN and FLT_MAX?
If so, what would be a good alternative to get a string ("-8.34416e-46") into a float?
An Alternative
Convert to double with std::stod and then assign to float. Add bounds checking if desired.
Converting “-8.34416e-46”
The C++ standard allows conversions of strings to float to report underflow if the result is in the subnormal range even though it is representable.
When rounding to the nearest representable value is used, −8.34416•10−46 is within the range of float (in C++ implementations that use IEEE-754 binary32 for float, which is common), but it is in the subnormal range. The C++ standard says stof calls strtof and then defers to the C standard to define strtof. The C standard indicates that strtof may underflow, about which it says “The result underflows if the magnitude of the mathematical result is so small that the mathematical result cannot be represented, without extraordinary roundoff error, in an object of the specified type.” That is awkward phrasing, but it refers to the rounding errors that occur when subnormal values are encountered. (Subnormal values are subject to larger relative errors than normal values, so their rounding errors might be said to be extraordinary.)
Thus, a C++ implementation is allowed by the C++ standard to underflow for subnormal values even though they are representable.
The smallest positive magnitude in the binary32 format is 2−149, about 1.4•10−45. 8.34416•10−46 is smaller than this, but it is greater than half of 2−149. That means, between 0 and 2−149, it is closer to the latter, so conversion with rounding to the nearest representable value will produce 2−149 rather than zero. Unfortunately, your strtof implementation chooses to report underflow rather than completing a conversion to the nearest representable value.
Normal and subnormal values
For IEEE-754 32-bit binary floating-point, the normal range is from 2-126 to 2128-2104. Within this range, every number is represented with a signficand (the fraction portion of the floating-point representation) that has a leading 1 bit followed by 23 additional bits, and so the error that occurs when rounding any real number in this range to the nearest representable value is at most 2-24 times the position value of the leading bit.
In additional to this normal range, there is a subnormal range from 2−149 to 2−126−2−149. In this interval, the exponent part of the floating-point format has reached its smallest value and cannot be decreased any more. To represent smaller and smaller numbers in this interval, the significand is reduced below the normal minimum of 1. It starts with a 0 and is followed by 23 additional bits. In this interval, the error that occurs when rounding a real number to the nearest representable value may be larger than 2-24 times the position value of the leading bit. Since the exponent cannot be decreased any further, numbers in this interval have increasing numbers of leading 0 bits as they get smaller and smaller. Thus the relative errors involved with using these numbers grows.
For whatever reasons, the C++ has said that implementations may report underflow in this interval. (The IEEE-754 standard defines underflow in complicated ways and also allows implementations some choices.)

Does IEEE 754 float division or subtraction by itself always result in the same value?

Unless an IEEE 754 is NaN, +-0.0 or +-Infinity, is dividing by itself guaranteed to result in exactly 1.0?
Similarly, is subtracting itself guaranteed to always result in +-0.0?
IEEE 754-2008 4.3 says:
… Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause.
When an intermediate result is representable, all of the rounding attributes round it to itself; rounding changes a value only when it is not representable. Rules for subtraction and division are given in 5.4, and they do not state exceptions for the above.
Zero and one are representable per 3.3, which specify the sets of numbers representable in any format conforming to the standard. In particular, zero may be represented by a significand with all zero digits, and one may be represented with a significand starting with 1 and followed by “.000…000” and an exponent of zero. The minimum and maximums exponents of a format are defined so that they always include zero (emin is 1−emax, so zero is between them, inclusive, unless emax is less than 1, in which case no numbers are representable, i.e. the format is empty and not an actual format).
Since rounding does not change representable values and zero and one are representable, dividing a finite non-zero value by itself always produces one, and subtracting a finite value from itself always produces zero.
Dividing and substracting from itself, if same literal value was used - yes, IEEE754 requires to produce closest and cosistent match.

Is it possible in floating point to return 0.0 subtracting two different values?

Due to the floating point "approx" nature, its possible that two different sets of values return the same value.
Example:
#include <iostream>
int main() {
std::cout.precision(100);
double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;
std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
}
But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?
i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??
The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.
Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.
A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)
Sometimes systems with this behavior may offer a way of disabling it.
Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.
Some C++ implementations offer ways to disable or limit such behavior.
Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.
If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // system specific
double d = std::numeric_limits<double>::min(); // smallest normal
double n = std::nextafter(d, 10.0); // second smallest normal
double z = d - n; // a negative subnormal (flushed to zero)
std::cout << (z == 0) << '\n' << (d == n);
This should print
1
0
First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.
Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.
To understand the answer to this question we must first understand how floating point numbers work.
A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be
(-1)s2(e – e0)(m/2M)
Where:
s is the sign bit, with a value of 0 or 1.
e is the exponent field
e0 is the exponent bias. It essentially sets the overall range of the floating point number.
M is the number of mantissa bits.
m is the mantissa with a value between 0 and 2M-1
This is similar in concept to the scientific notation you were taught in school.
However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.
(-1)s2(e – e0)(1+(m/2M))
This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.
To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.
(-1)s2(1 – e0)(m/2M) when e = 0
(-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1
With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.
This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.
Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".
So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.
Excluding funny numbers like NAN, I don't think it's possible.
Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).
That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.

How does C++ round int to float/double?

How does C++ round, if signed/unsigned integers are implicitly converted to floats/doubles?
Like:
int myInt = SomeNumberWeCantExpressWithAFloat;
float myFloat = myInt;
My university script says the following: The resulting value is the representable value nearest to the original value, where ties are broken in an implementation-defined fashion.
Please explain how the "nearest representable value" is calculated and what "where ties are broken in an implementation-defined fashion" is supposed to mean.
Edit:
Since I work most of my time with the GCC, please give additional information about what floating point representation the GCC uses by default, if there is one.
Single-precision floating point numbers have 24-bit mantissa. On systems with 32-bit int representation values above 224 and below -(224) require rounding.
Value 224+1 = 16777217 is the first int that cannot be represented exactly in IEEE binary32 format. Two float representations are available - 16777216, which is below the exact value by 1, and 16777218, which is above the exact value, also by 1. Hence, we have a tie, meaning that C++ is allowed to choose either one of these two representations.
IEEE 754 specifies 5 different rounding modes about how to round integers:
A very common mode is called: Round to nearest, ties to even.
From GCC Wiki:
Without any explicit options, GCC assumes round to nearest or even and
does not care about signalling NaNs. Compare with C99's #pragma STDC
FENV ACCESS OFF. Also, see note on x86 and m68080.
Round to nearest, ties to even
From Wikipedia:
Rounding a number y to the nearest integer requires some tie-breaking
rule for those cases when y is exactly half-way between two integers —
that is, when the fraction part of y is exactly 0.5.
In such a situation the even one would be chosen. This applies for positive as for negative numbers.
Sources:
https://en.wikipedia.org/wiki/Rounding#Tie-breaking
https://gcc.gnu.org/wiki/FloatingPointMath
https://en.wikipedia.org/wiki/IEEE_floating_point
Feel free to edit. Additional information about conversion rules for rational/irrational numbers is appreciated.

How does computer convert floating point numbers to decimal string?

When I run the following code, the output is accurately the number 2500 in decimal.
(g++ 5.3.1 on ubuntu)
#include<iostream>
#include<cmath>
using namespace std;
int main(){
cout.precision(0);
cout << fixed << pow(2.0,500.0);
return 0;
}
I wonder how C++ converted this floating point number to its decimal string at such a high precision.
I know that 2500 can be accurately presented in IEEE 754 format. But I think mod 10 and divided by 10 can cause precision loss on floating point numbers. What algorithm is used when the conversion proceed?
Yes, there exists an exact double-precision floating-point representation for 2500. You should not assume that pow(2.0,500.0) produces this value, though. There is no guarantee of accuracy for the function pow, and you may find SO questions that arose from pow(10.0, 2.0) not producing 100.0, although the mathematical result was perfectly representable too.
But anyway, to answer your question, the conversion from the floating-point binary representation to decimal does not in general rely on floating-point operations, which indeed would be too inaccurate for the intended accuracy of the end result. In general, accurate conversion requires reliance on big integer arithmetics. In the case of 2500, for instance, the naïve algorithm would be to repeatedly divide the big integer written in binary 1000…<500 zeroes in total>… by ten.
There are some cases where floating-point arithmetic can be used, for instance taking advantage of the fact that powers of 10 up to 1023 are represented exactly in IEEE 754 double-precision. But correctly rounded conversion between binary floating-point and decimal floating-point always require big integer arithmetics in general, and this is particularly visible far away from 1.0.