In MSDN article, it mentions when fp:fast mode is enabled, operations like additive identity (a±0.0 = a, 0.0-a = -a) are unsafe. Is there any example that a+0 != a under such mode?
EDIT: As someone mentioned below, this sort of issue normally comes up when doing comparison. My issue is from comparison, the psedocode looks like below:
for(i=0;i<v.len;i++)
{
sum+=v[i];
if( sum >= threshold) break;
}
It breaks after adding a value of 0 (v[i]). The v[i] is not from calculation, it is assigned. I understand if my v[i] is from calculation then rounding might come into play, but why even though I give v[i] a zero value, I still have this sum < threshold but sum + v[i] >= threshold?
The reason that it's "unsafe" is that what the compiler assumes to be zero may not really end up being zero, due to rounding errors.
Take this example which adds two floats on the edge of the precision which 32 bit floats allows:
float a = 33554430, b = 16777215;
float x = a + b;
float y = x - a - b;
float z = 1;
z = z + y;
With fp:fast, the compiler says "since x = a + b, y = x - a - b = 0, so 'z + y' is just z". However, due to rounding errors, y actually ends up being -1, not 0. So you would get a different result without fp:fast.
It's not saying something 'fixed' like, "if you set /fp:fast, and variable a happens to be 3.12345, then a+0 might not be a". It's saying that when you set /fp:fast, the compiler will take shortcuts that mean that if you compute a+0, and then compare that to what you stored for a, there is no guarantee that they'll be the same.
There is a great write up on this class of problems (which are endemic to floating point calculations on computers) here: http://www.parashift.com/c++-faq-lite/floating-point-arith2.html
If a is -0.0, then a + 0.0 is +0.0.
it mentions when fp:fast mode is enabled, operations like additive identity (a±0.0 = a, 0.0-a = -a) is unsafe.
What that article says is
Any of the following (unsafe) algebraic rules may be employed by the optimizer when the fp:fast mode is enabled:
And then it lists a±0.0 = a and 0.0-a = -a
It is not saying that these identities are unsafe when fp:fast is enabled. It is saying that these identities are not true for IEEE 754 floating point arithmetic but that /fp:fast will optimize as though they are true.
I'm not certain of an example that shows that a + 0.0 == a to be false (except for NaN, obviously), but IEEE 754 has a lot of subtleties, such as when intermediate values should be truncated. One possibility is that if you have some expression that includes + 0.0, that might result in a requirement under IEEE 754 to do truncation on an intermediate value, but that /fp:fast will generate code that doesn't do the truncation, and consequently later results may differ from what is strictly required by IEEE 754.
Using Pascal Cuoq's info here's a program that produces different output based on /fp:fast
#include <cmath>
#include <iostream>
int main() {
volatile double a = -0.0;
if (_copysign(1.0, a + 0.0) == _copysign(1.0, 0.0)) {
std::cout << "correct IEEE 754 results\n";
} else {
std::cout << "result not IEEE 754 conformant\n";
}
}
When built with /fp:fast the program outputs "result not IEEE 754 conformant" while building with /fp:strict cause the program to output "correct IEEE 754 results".
Related
In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.
Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.
To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?
Consider the following code:
int main()
{
double minval = std::numeric_limits<double>::min(), wg = 8038,
ans = floor((wg - minval) / wg); // expecting the answer to round to 0
cout << ans; // but the answer actually resulted as 1!
}
A double typically has around 16 digits of precision.
You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.
8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.
Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.
The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.
As an example, consider the code below.
int main()
{
double b, c, d;
vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
for( int i = 0; i < 9; i++ ) {
b = std::nextafter( a[i], 0 );
c = a[i] - b;
d = 1e-17;
if( (bool)(d > c) )
cout << "True" << "\t";
else
cout << "False" << "\t";
cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
}
return 0;
}
The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.
This can be observed from the output of the C++ program as shown below:
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:
Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause
The statement in further repeated in Section 5.1 of Page 29 as shown below:
Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).
C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].
It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.
I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.
"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.
I'm currently trying to learn about floating point representation in depth, so I played around a bit. While doing so, I stumbled on some strange behaviour; I can't really work out what's happening, and I'd be very grateful for some insight. Apologies if this has been answered, I found it quite hard to google!
#include <iostream>
#include <cmath>
using namespace std;
int main(){
float minVal = pow(2,-149); // set to smallest float possible
float nextCheck = static_cast<float>(minVal/2.0f); // divide by two
bool isZero = (static_cast<float>(minVal/2.0f) == 0.0f); // this evaluates to false
bool isZero2 = (nextCheck == 0.0f); // this evaluates to true
cout << nextCheck << " " << isZero << " " << isZero2 << endl;
// this outputs 0 0 1
return 0;
}
Essentially what's happening is:
I set minVal to be the smallest float that can be represented using
single precision
Dividing by 2 should yield 0 -- we're at the minimum
Indeed, isZero2 does return true, but isZero returns false.
What's going on -- I would have thought them to be identical? Is the compiler trying to be clever, saying that dividing any number cannot possibly yield zero?
Thanks for your help!
The reason isZero and isZero2 can evaluate to different values, and isZero can be false, is that the C++ compiler is allowed to implement intermediate floating-point operations with more precision than the type of the expression would indicate, but the extra precision has to be dropped on assignment.
Typically, when generating code for the 387 historical FPU, the generated instructions work on either the 80-bit extended-precision type, or, if the FPU is set to a 53-bit significand (e.g. on Windows), a strange floating-point type with 53-bit significands and 15-bit exponents.
Either way, minVal/2.0f is evaluated exactly because the exponent range allows to represent it, but assigning it to nextCheck rounds it to zero.
If you are using GCC, there is the additional problem that -fexcess-precision=standard has not yet been implemented for the C++ front-end, meaning that the code generated by g++ does not implement exactly what the standard recommends.
I have an arithmetic expression, for example:
float z = 8.0
float x = 3.0;
float n = 0;
cout << z / (x/n) + 1 << endl;
Why I get normal answer equal to 1, when it should be "nan", "1.#inf", etc.?
I assume you're using floating point arithmetic (though one can't be sure, because you're not telling us).
IEEE754 floating point semantics work on the extended real line and include infinities on both ends. This makes divisions with non-zero numerator well-defined for any (non-NaN) denominator, "consistent with" (i.e. extending continuously) the usual arithmetic rules: x / n is infinity, and z divided by infinity is zero — just as if you had simplified the expression as n * z / x.
The only genuinely undefined quantities are 0/0 and inf/inf, which are represented by the special value NaN.
The IEEE 754 specifies that 3/0 = Inf (or anything positive instead of 3). 8/Inf gives 0. If you add 1 you'll receive 1. This is because 0 denotes "0 or something very close to it" and Inf "Infinity or very big number". It also allows to perform some operations on limits as it effectively extends the real numbers into by infinities. NaN's are reserved when the limit is not achievable (or not easily computable by simple implementation).
As a side effect you have some strange effects like 0 == -0 but 1/0 == Inf and 1/-0 == -Inf. It is important to remember that FP arithmetic is not normal - for example cos(x) * cos(x) + sin(x) * sin(x) - 1 != 0 even if x != NaN && x != Inf && x != -Inf. For floats and x == 1 the result is -5.9604645e-8. Therefore not all expectation can be easily transferred to it - like division by 0 in this case.
While C/C++ does not mandate that IEE 754 specification will be used for floating point numbers it is the specification right now and is implemented on virtually any hardware and for that reason used by most C/C++ implementations.
I remarked two things:
std::numeric_limits<float>::max()+(a small number) gives: std::numeric_limits<float>::max().
std::numeric_limits<float>::max()+(a large number like: std::numeric_limits<float>::max()/3) gives inf.
Why this difference? Does 1 or 2 results in an OVERFLOW and thus to an undefined behavior?
Edit: Code for testing this:
1.
float d = std::numeric_limits<float>::max();
float q = d + 100;
cout << "q: " << q << endl;
2.
float d = std::numeric_limits<float>::max();
float q = d + (d/3);
cout << "q: " << q << endl;
Formally, the behavior is undefined. On a machine with IEEE
floating point, however, overflow after rounding will result
in Inf. The precision is limited, however, and the results
after rounding of FLT_MAX + 1 are FLT_MAX.
You can see the same effect with values well under FLT_MAX.
Try something like:
float f1 = 1e20; // less than FLT_MAX
float f2 = f1 + 1.0;
if ( f1 == f2 ) ...
The if will evaluate to true, at least with IEEE arithmetic.
(There do exist, or at least have existed, machines where
float has enough precision for the if to evaluate to
false, but they aren't very common today.)
It depends on what you are doing. If the float "overflow" comes in an expression which is directly returned, i.e.
return std::numeric_limits::max() + std::numeric_limits::max();
the operation might not result in an overflow. I cite from the C standard [ISO/IEC 9899:2011]:
The return statement is not an assignment. The overlap restriction of
subclause 6.5.16.1 does not apply to the case of function return. The
representation of floating-point values may have wider range or
precision than implied by the type; a cast may be used to remove this
extra range and precision.
See here for more details.
I was wondering if there is a way of overcoming an accuracy problem that seems to be the result of my machine's internal representation of floating-point numbers:
For the sake of clarity the problem is summarized as:
// str is "4.600"; atof( str ) is 4.5999999999999996
double mw = atof( str )
// The variables used in the columns calculation below are:
//
// mw = 4.5999999999999996
// p = 0.2
// g = 0.2
// h = 1 (integer)
int columns = (int) ( ( mw - ( h * 11 * p ) ) / ( ( h * 11 * p ) + g ) ) + 1;
Prior to casting to an integer type the result of the columns calculation is 1.9999999999999996; so near yet so far from the desired result of 2.0.
Any suggestions most welcome.
When you use floating point arithmetic strict equality is almost meaningless. You usually want to compare with a range of acceptable values.
Note that some values can not be represented exactly as floating point vlues.
See What Every Computer Scientist Should Know About Floating-Point Arithmetic and Comparing floating point numbers.
There's no accurracy problem.
The result you got (1.9999999999999996) differed from the mathematical result (2) by a margin of 1E-16. That's quite accurate, considering your input "4.600".
You do have a rounding problem, of course. The default rounding in C++ is truncation; you want something similar to Kip's solution. Details depend on your exact domain, do you expect round(-x)== - round(x) ?
If you haven't read it, the title of this paper is really correct. Please consider reading it, to learn more about the fundamentals of floating-point arithmetic on modern computers, some pitfalls, and explanations as to why they behave the way they do.
A very simple and effective way to round a floating point number to an integer:
int rounded = (int)(f + 0.5);
Note: this only works if f is always positive. (thanks j random hacker)
If accuracy is really important then you should consider using double precision floating point numbers rather than just floating point. Though from your question it does appear that you already are. However, you still have a problem with checking for specific values. You need code along the lines of (assuming you're checking your value against zero):
if (abs(value) < epsilon)
{
// Do Stuff
}
where "epsilon" is some small, but non zero value.
On computers, floating point numbers are never exact. They are always just a close approximation. (1e-16 is close.)
Sometimes there are hidden bits you don't see. Sometimes basic rules of algebra no longer apply: a*b != b*a. Sometimes comparing a register to memory shows up these subtle differences. Or using a math coprocessor vs a runtime floating point library. (I've been doing this waayyy tooo long.)
C99 defines: (Look in math.h)
double round(double x);
float roundf(float x);
long double roundl(long double x);
.
Or you can roll your own:
template<class TYPE> inline int ROUND(const TYPE & x)
{ return int( (x > 0) ? (x + 0.5) : (x - 0.5) ); }
For floating point equivalence, try:
template<class TYPE> inline TYPE ABS(const TYPE & t)
{ return t>=0 ? t : - t; }
template<class TYPE> inline bool FLOAT_EQUIVALENT(
const TYPE & x, const TYPE & y, const TYPE & epsilon )
{ return ABS(x-y) < epsilon; }
Use decimals: decNumber++
You can read this paper to find what you are looking for.
You can get the absolute value of the result as seen here:
x = 0.2;
y = 0.3;
equal = (Math.abs(x - y) < 0.000001)