I have an arithmetic expression, for example:
float z = 8.0
float x = 3.0;
float n = 0;
cout << z / (x/n) + 1 << endl;
Why I get normal answer equal to 1, when it should be "nan", "1.#inf", etc.?
I assume you're using floating point arithmetic (though one can't be sure, because you're not telling us).
IEEE754 floating point semantics work on the extended real line and include infinities on both ends. This makes divisions with non-zero numerator well-defined for any (non-NaN) denominator, "consistent with" (i.e. extending continuously) the usual arithmetic rules: x / n is infinity, and z divided by infinity is zero — just as if you had simplified the expression as n * z / x.
The only genuinely undefined quantities are 0/0 and inf/inf, which are represented by the special value NaN.
The IEEE 754 specifies that 3/0 = Inf (or anything positive instead of 3). 8/Inf gives 0. If you add 1 you'll receive 1. This is because 0 denotes "0 or something very close to it" and Inf "Infinity or very big number". It also allows to perform some operations on limits as it effectively extends the real numbers into by infinities. NaN's are reserved when the limit is not achievable (or not easily computable by simple implementation).
As a side effect you have some strange effects like 0 == -0 but 1/0 == Inf and 1/-0 == -Inf. It is important to remember that FP arithmetic is not normal - for example cos(x) * cos(x) + sin(x) * sin(x) - 1 != 0 even if x != NaN && x != Inf && x != -Inf. For floats and x == 1 the result is -5.9604645e-8. Therefore not all expectation can be easily transferred to it - like division by 0 in this case.
While C/C++ does not mandate that IEE 754 specification will be used for floating point numbers it is the specification right now and is implemented on virtually any hardware and for that reason used by most C/C++ implementations.
Related
In C++, we know that we can find the minimum representable double precision value using std::numeric_limits<double>::min(). The value turns out to be 2.22507e-308 when printed.
Now if a given double value (say val) is subtracted from this minimum value and then a division is undertaken with the same previous double value (val - minval) / val, I was expecting the answer to be rounded to 0 if the operation floor((val - minval ) / val) was performed on the resulting divided value.
To my surprise, the answer is delivered as 1. Can someone please explain this anomalous behavior?
Consider the following code:
int main()
{
double minval = std::numeric_limits<double>::min(), wg = 8038,
ans = floor((wg - minval) / wg); // expecting the answer to round to 0
cout << ans; // but the answer actually resulted as 1!
}
A double typically has around 16 digits of precision.
You're starting with 8038. For simplicity, I'm going to call that 8.038e3. Since we have around 16 digits of precision, the smallest number we can subtract from that and still get a result different from 8038 is 8038e(3-16) = 8038e-13.
8038 - 2.2e-308 is like reducing the mass of the universe by one electron, and expecting that to affect the mass of the universe by a significant amount.
Actually, relatively speaking, 8038-2.2e-308 is a much smaller change than removing a whole electron from the universe--more like removing a minuscule fraction of a single electron from the universe, if that were possible. Even if we were to assume that string theory were correct, even removing one string from the universe would still be a huge change compared to subtracting 2.2e-308 from 8038.
The comments and the previous answer correctly attribute the cause to floating point precision issues but there are additional details needed to explain the correct behavior. In fact, even in cases where subtraction cannot be carried out such that the results of the subtraction cannot be represented with the finite precision of floating point numbers, inexact rounding is still performed by the compiler and subtraction is not completely discarded.
As an example, consider the code below.
int main()
{
double b, c, d;
vector<double> a{0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.6, 0.7};
cout << "Subtraction Possible?" << "\t" << "Floor Result" << "\n";
for( int i = 0; i < 9; i++ ) {
b = std::nextafter( a[i], 0 );
c = a[i] - b;
d = 1e-17;
if( (bool)(d > c) )
cout << "True" << "\t";
else
cout << "False" << "\t";
cout << setprecision(52) << floor((a[i] - d)/a[i]) << "\n";
}
return 0;
}
The code takes in different double precision values in the form of vector a and performs subtraction from 1e-17. It must be noted that the smallest value that can be subtracted from 0.07 is shown to be 1.387778780781445675529539585113525390625e-17 using std::nextafter for the value 0.07. This means that 1e-17 is smaller than the smallest value which can be subtracted from any of these numbers. Hence, theoretically, subtraction should not be possible for any of the numbers listed in vector a. If we assume that the subtraction results are discarded, then the answer must always stay 1 but it turns out that sometimes the answer is 0 and other times 1.
This can be observed from the output of the C++ program as shown below:
Subtraction Possible? Floor Result
False 0
False 0
False 0
False 0
False 1
False 1
False 1
False 1
False 1
The reasons lay buried within the Floating Point specification prescribed in the IEEE 754 document. In general the standard specifically states that even in cases where the results of an operation cannot be represented, rounding must be carried out. I quote Page 27, Section 4.3 of the IEEE 754, 2019 document:
Except where stated otherwise, every operation shall be performed as if it first produced an
intermediate result correct to infinite precision and with unbounded range, and then rounded that result
according to one of the attributes in this clause
The statement in further repeated in Section 5.1 of Page 29 as shown below:
Unless otherwise specified, each of the computational
operations specified by this standard that returns a numeric result shall be performed as if it first produced
an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result, if necessary, to fit in the destination’s format (see Clause 4 and Clause 7).
C++'s g++ compiler (which I have been testing) correctly and very precisely interprets the standard by implementing nearest rounding stated in Section 4.3.1 of the IEEE 754 document. This has the implication that even when a[i] - b is not representable, a numeric result is delivered as if the subtraction first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that
intermediate result. Hence, it may or may not be the case that a[i] - b == a[i] which means that the answer may or may not be 1 depending on whether a[i] - b is closer to a[i] or it is closer to the next representable value after a[i].
It turns out that 8038 - 2.22507e-308 is closer to 8038 due to which the answer is rounded (using nearest rounding) to 8038 and the final answer is 1 but this is to only state that this behavior does result from the compiler's interpretation of the standard and is not something arbitrary.
I found below references on Floating Point numbers to be very useful. I would recommend reading Cleve Moler's (founder of MATLAB) reference on floating point numbers before going through the IEEE specification for a quick and easy understanding of their behavior.
"IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Moler, Cleve. “Floating Points.” MATLAB News and Notes. Fall, 1996.
Real Close to the Machine: Floating Point in D
https://dlang.org/articles/d-floating-point.html
says
Useful relations for a floating point type F, where x and y are of type F
...
x>0 if and only if 1/(1/x) > 0; x<0 if and only if 1/(1/x) < 0.
what is the meaning of this sentence?
In the text you're quoting, we're looking at how the representation is symmetric around 1, and that the rounding doesn't break this. That is, for any number 0 < x < 1, there's a corresponding number 1 < y < ∞, such that y = 1/x and 1/y = x. That's the first half - the second is simply the same for negative numbers: 0 > x > -1 and -1 > y > -∞.
It may not be immediately obvious how this can be a problem, but consider x = 3.
y must then be 1/3 = 0.333.... With a limited precision of 3 decimal digits, 1/y would then be 3.003003003.... IEEE 754 defines how this should work, and says that the rounding should ensure 1/(1/x) should be equal to x, and thus that the result should be 3, even if there are rounding errors in both 1/x and 1/y - they should cancel each other out.
Older floating-point systems weren't as well-behaved as IEEE 754. I'm not sure if any of them weren't symmetric around 1, but that's certainly within the realm of possibility.
Am I right that any arithmetic operation on any floating numbers is unambiguously defined by IEEE floating point standard? If yes, just for curiosity, what is (+0)+(-0)? And is there a way to check such things in practice, in C++ or other commonly used programming language?
The IEEE 754 rules of arithmetic for signed zeros state that +0.0 + -0.0 depends on the rounding mode. In the default rounding mode, it will be +0.0. When rounding towards -∞, it will be -0.0.
You can check this in C++ like so:
#include <iostream>
int main() {
std::cout << "+0.0 + +0.0 == " << +0.0 + +0.0 << std::endl;
std::cout << "+0.0 + -0.0 == " << +0.0 + -0.0 << std::endl;
std::cout << "-0.0 + +0.0 == " << -0.0 + +0.0 << std::endl;
std::cout << "-0.0 + -0.0 == " << -0.0 + -0.0 << std::endl;
return 0;
}
Output:
+0.0 + +0.0 == 0
+0.0 + -0.0 == 0
-0.0 + +0.0 == 0
-0.0 + -0.0 == -0
My answer deals with IEEE 754:2008, which is the current version of the standard.
In the IEEE 754:2008 standard:
Section 4.3 deals with the rounding of values when performing arithmetic operations in order to fit the bits into the mantissa.
4.3 Rounding-direction attributes
Rounding takes a number regarded as infinitely precise and, if necessary, modifies it to fit in the destination’s format while signaling the inexact exception, underflow, or overflow when appropriate (see 7). Except where stated otherwise, every operation shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result according to one of the attributes in this clause.
The rounding-direction attribute affects all computational operations that might be inexact. Inexact numeric floating-point results always have the same sign as the unrounded result.
The rounding-direction attribute affects the signs of exact zero sums (see 6.3), and also affects the thresholds beyond which overflow and underflow are signaled.
Section 6.3 prescribes the value of the sign bit when performing arithmetic with special values (NaN, infinities, +0, -0).
6.3 The sign bit
When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0.
However, x + x = x − (−x) retains the same sign as x even when x is zero.
(emphasis mine)
In other words, (+0) + (-0) = +0 except when the rounding mode is roundTowardNegative, in which case it is (+0) + (-0) = -0.
In the context of C#:
According to §7.7.4 of the C# Language Specification (emphasis mine):
Floating-point addition:
float operator +(float x, float y);
double operator +(double x, double y);
The sum is computed according to the rules of IEEE 754 arithmetic. The following table lists the results of all possible combinations of nonzero finite values, zeros, infinities, and NaN's. In the table, x and y are nonzero finite values, and z is the result of x + y. If x and y have the same magnitude but opposite signs, z is positive zero. If x + y is too large to represent in the destination type, z is an infinity with the same sign as x + y.
+ • x +0 -0 +∞ -∞ NaN
•••••••••••••••••••••••••••••••••••••••••••••
y • z y y +∞ -∞ NaN
+0 • x +0 +0 +∞ -∞ NaN
-0 • x +0 -0 +∞ -∞ NaN
+∞ • +∞ +∞ +∞ +∞ NaN NaN
-∞ • -∞ -∞ -∞ NaN -∞ NaN
NaN • NaN NaN NaN NaN NaN NaN
(+0) + (-0) in C#:
In other words, based on the specification, the addition of two zeros only results in negative zero if both are negative zero. Therefore, the answer to the original question
What is (+0)+(-0) by IEEE floating point standard?
is +0.
Rounding modes in C#:
In case anyone is interested in changing the rounding mode in C#, in "Is there an C# equivalent of c++ fesetround() function?", Hans Passant states:
Never tinker with the FPU control word in C#. It is the worst possible global variable you can imagine. With the standard misery that globals cause, your changes cannot last and will arbitrarily disappear. The internal exception handling code in the CLR resets it when it processes an exception.
Assume standard rounding mode (which you are using if you don't know what a rounding mode is and how to change it).
If the exact result is non-zero but so small that it gets rounded to zero, the result is +0 if the exact result is greater than 0, and -0 if the exact result is less than 0. This situation only happens for multiplication and division, not for addition and subtraction.
There are several cases where the exact result is zero. In that case the result is -0 in the following cases: Adding (-0) + (-0). Subtracting (-0) - (+0). Multiplying where one factor is a zero, and the other factor has the opposite sign (including (+0) * (-0). Dividing a zero by a non-zero number including infinity of the opposite sign. In all other cases, the result is +0.
An unfortunate side effect of this rule is that x + 0.0 is not always identical to x (not when x is -0). On the other hand, x - 0.0 is always identical to x. Also, x * 0.0 may be +0 or -0, depending on x. This prevents some optimisations by compilers that support IEE754 precisely, or makes them more difficult.
The answer, by the IEEE floating point standard, is +0.
I know it is incorrect to compare double (equality) and the best is to use an epsilon factor as described into the Knuth book (Art of programming). Nevertheless, I am working on a legacy code (C++), where there are a lot of devision like:
// b,c double from previous computation
if( b == 50.0)
b += 0.001;
double a = c/(b - 50.0);
Do we perform the conditional statement (b == 50) on the "bit representation" (mantissa-exponenent) or the decimal one ? I do not find this information on my C++ book. If it is the decimal, I think I can trough away the conditional statement.
The == operator is applied to the run-time representation of the floating-point value, ideally with exactly the exponent and significand numbers of bits implied by the type, but unfortunately, sometimes in a wider format, as allowed by the standard.
In b == 50.0, the decimal representation 50.0 is converted to such a floating-point representation at compile-time once and for all. That value is then used (or the program behaves as if it was used) each time this expression 50.0 is involved. In the case of 50.0, it does not make a difference because the number 50 can be represented exactly as a binary floating-point value.
As an example, b == 50.0000000000000000000001 is likely to behave exactly as b == 50.0 because 50.0000000000000000000001 represents the same floating-point value as 50.0.
For the specific piece of code, the use of exact comparison is correct:
// b,c double from previous computation
if( b == 50.0)
b += 0.001;
double a = c/(b - 50.0);
The purpose seems to be to ensure that the division will not be a division by zero. The code may have been written to be compatible with systems in which division by 0 causes failure, rather than infinity. Subtracting 50 from any double that is not exactly 50 will have a non-zero result, so the 0.001 fudge factor only needs to be added in the case of exact equality.
I have a float value between 0 and 1. I need to convert it with -120 to 80.
To do this, first I multiply with 200 after 120 subtract.
When subtract is made I had rounding error.
Let's look my example.
float val = 0.6050f;
val *= 200.f;
Now val is 121.0 as I expected.
val -= 120.0f;
Now val is 0.99999992
I thought maybe I can avoid this problem with multiplication and division.
float val = 0.6050f;
val *= 200.f;
val *= 100.f;
val -= 12000.0f;
val /= 100.f;
But it didn't help. I have still 0.99 on my hand.
Is there a solution for it?
Edit: After with detailed logging, I understand there is no problem with this part of code. Before my log shows me "0.605", after I had detailed log and I saw "0.60499995946884155273437500000000000000000000000000"
the problem is in different place.
Edit2: I think I found the guilty. The initialised value is 0.5750.
std::string floatToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
float val88 = 0.57500000000f;
std::cout << floatToStr(val88) << std::endl;
}
The result is 0.574999988079071
Actually I need to add and sub 0.0025 from this value every time.
Normally I expected 0.575, 0.5775, 0.5800, 0.5825 ....
Edit3: Actually I tried all of them with double. And it is working for my example.
std::string doubleToStr(double d)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(15) << d;
return ss.str();
}
int main()
{
double val88 = 0.575;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
val88 += 0.0025;
std::cout << doubleToStr(val88) << std::endl;
return 0;
}
The results are:
0.575000000000000
0.577500000000000
0.580000000000000
0.582500000000000
But I bound to float unfortunately. I need to change lots of things.
Thank you for all to help.
Edit4: I have found my solution with strings. I use ostringstream's rounding and convert to double after that. I can have 4 precision right numbers.
std::string doubleToStr(double d, int precision)
{
std::stringstream ss;
ss << std::fixed << std::setprecision(precision) << d;
return ss.str();
}
double val945 = (double)0.575f;
std::cout << doubleToStr(val945, 4) << std::endl;
std::cout << doubleToStr(val945, 15) << std::endl;
std::cout << atof(doubleToStr(val945, 4).c_str()) << std::endl;
and results are:
0.5750
0.574999988079071
0.575
Let us assume that your compiler implements IEEE 754 binary32 and binary64 exactly for float and double values and operations.
First, you must understand that 0.6050f does not represent the mathematical quantity 6050 / 10000. It is exactly 0.605000019073486328125, the nearest float to that. Even if you write perfect computations from there, you have to remember that these computations start from 0.605000019073486328125 and not from 0.6050.
Second, you can solve nearly all your accumulated roundoff problems by computing with double and converting to float only in the end:
$ cat t.c
#include <stdio.h>
int main(){
printf("0.6050f is %.53f\n", 0.6050f);
printf("%.53f\n", (float)((double)0.605f * 200. - 120.));
}
$ gcc t.c && ./a.out
0.6050f is 0.60500001907348632812500000000000000000000000000000000
1.00000381469726562500000000000000000000000000000000000
In the above code, all computations and intermediate values are double-precision.
This 1.0000038… is a very good answer if you remember that you started with 0.605000019073486328125 and not 0.6050 (which doesn't exist as a float).
If you really care about the difference between 0.99999992 and 1.0, float is not precise enough for your application. You need to at least change to double.
If you need an answer in a specific range, and you are getting answers slightly outside that range but within rounding error of one of the ends, replace the answer with the appropriate range end.
The point everybody is making can be summarised: in general, floating point is precise but not exact.
How precise is governed by the number of bits in the mantissa -- which is 24 for float, and 53 for double (assuming IEEE 754 binary formats, which is pretty safe these days ! [1]).
If you are looking for an exact result, you have to be ready to deal with values that differ (ever so slightly) from that exact result, but...
(1) The Exact Binary Fraction Problem
...the first issue is whether the exact value you are looking for can be represented exactly in binary floating point form...
...and that is rare -- which is often a disappointing surprise.
The binary floating point representation of a given value can be exact, but only under the following, restricted circumstances:
the value is an integer, < 2^24 (float) or < 2^53 (double).
this is the simplest case, and perhaps obvious. Since you are looking a result >= -120 and <= 80, this is sufficient.
or:
the value is an integer which divides exactly by 2^n and is then (as above) < 2^24 or < 2^53.
this includes the first rule, but is more general.
or:
the value has a fractional part, but when the value is multiplied by the smallest 2^n necessary to produce an integer, that integer is < 2^24 (float) or 2^53 (double).
This is the part which may come as a surprise.
Consider 27.01, which is a simple enough decimal value, and clearly well within the ~7 decimal digit precision of a float. Unfortunately, it does not have an exact binary floating point form -- you can multiply 27.01 by any 2^n you like, for example:
27.01 * (2^ 6) = 1728.64 (multiply by 64)
27.01 * (2^ 7) = 3457.28 (multiply by 128)
...
27.01 * (2^10) = 27658.24
...
27.01 * (2^20) = 28322037.76
...
27.01 * (2^25) = 906305208.32 (> 2^24 !)
and you never get an integer, let alone one < 2^24 or < 2^53.
Actually, all these rules boil down to one rule... if you can find an 'n' (positive or negative, integer) such that y = value * (2^n), and where y is an exact, odd integer, then value has an exact representation if y < 2^24 (float) or if y < 2^53 (double) -- assuming no under- or over-flow, which is another story.
This looks complicated, but the rule of thumb is simply: "very few decimal fractions can be represented exactly as binary fractions".
To illustrate how few, let us consider all the 4 digit decimal fractions, of which there are 10000, that is 0.0000 up to 0.9999 -- including the trivial, integer case 0.0000. We can enumerate how many of those have exact binary equivalents:
1: 0.0000 = 0/16 or 0/1
2: 0.0625 = 1/16
3: 0.1250 = 2/16 or 1/8
4: 0.1875 = 3/16
5: 0.2500 = 4/16 or 1/4
6: 0.3125 = 5/16
7: 0.3750 = 6/16 or 3/8
8: 0.4375 = 7/16
9: 0.5000 = 8/16 or 1/2
10: 0.5625 = 9/16
11: 0.6250 = 10/16 or 5/8
12: 0.6875 = 11/16
13: 0.7500 = 12/16 or 3/4
14: 0.8125 = 13/16
15: 0.8750 = 14/16 or 7/8
16: 0.9375 = 15/16
That's it ! Just 16/10000 possible 4 digit decimal fractions (including the trivial 0 case) have exact binary fraction equivalents, at any precision. All the other 9984/10000 possible decimal fractions give rise to recurring binary fractions. So, for 'n' digit decimal fractions only (2^n) / (10^n) can be represented exactly -- that's 1/(5^n) !!
This is, of course, because your decimal fraction is actually the rational x / (10^n)[2] and your binary fraction is y / (2^m) (for integer x, y, n and m), and for a given binary fraction to be exactly equal to a decimal fraction we must have:
y = (x / (10^n)) * (2^m)
= (x / ( 5^n)) * (2^(m-n))
which is only the case when x is an exact multiple of (5^n) -- for otherwise y is not an integer. (Noting that n <= m, assuming that x has no (spurious) trailing zeros, and hence n is as small as possible.)
(2) The Rounding Problem
The result of a floating point operation may need to be rounded to the precision of the destination variable. IEEE 754 requires that the operation is done as if there were no limit to the precision, and the ("true") result is then rounded to the nearest value at the precision of the destination. So, the final result is as precise as it can be... given the limitations on how precise the arguments are, and how precise the destination is... but not exact !
(With floats and doubles, 'C' may promote float arguments to double (or long double) before performing an operation, and the result of that will be rounded to double. The final result of an expression may then be a double (or long double), which is then rounded (again) if it is to be stored in a float variable. All of this adds to the fun ! See FLT_EVAL_METHOD for what your system does -- noting the default for a floating point constant is double.)
So, the other rules to remember are:
floating point values are not reals (they are, in fact, rationals with a limited denominator).
The precision of a floating point value may be large, but there are lots of real numbers that cannot be represented exactly !
floating point expressions are not algebra.
For example, converting from degrees to radians requires division by π. Any arithmetic with π has a problem ('cos it's irrational), and with floating point the value for π is rounded to whatever floating precision we are using. So, the conversion of (say) 27 (which is exact) degrees to radians involves division by 180 (which is exact) and multiplication by our "π". However exact the arguments, the division and the multiplication may round, so the result is may only approximate. Taking:
float pi = 3.14159265358979 ; /* plenty for float */
float x = 27.0 ;
float y = (x / 180.0) * pi ;
float z = (y / pi) * 180.0 ;
printf("z-x = %+6.3e\n", z-x) ;
my (pretty ordinary) machine gave: "z-x = +1.907e-06"... so, for our floating point:
x != (((x / 180.0) * pi) / pi) * 180 ;
at least, not for all x. In the case shown, the relative difference is small -- ~ 1.2 / (2^24) -- but not zero, which simple algebra might lead us to expect.
hence: floating point equality is a slippery notion.
For all the reasons above, the test x == y for two floating values is problematic. Depending on how x and y have been calculated, if you expect the two to be exactly the same, you may very well be sadly disappointed.
[1] There exists a standard for decimal floating point, but generally binary floating point is what people use.
[2] For any decimal fraction you can write down with a finite number of digits !
Even with double precision, you'll run into issues such as:
200. * .60499999999999992 = 120.99999999999997
It appears that you want some type of rounding so that 0.99999992 is rounded to 1.00000000 .
If the goal is to produce values to the nearest multiple of 1/1000, try:
#include <math.h>
val = (float) floor((200000.0f*val)-119999.5f)/1000.0f;
If the goal is to produce values to the nearest multiple of 1/200, try:
val = (float) floor((40000.0f*val)-23999.5f)/200.0f;
If the goal is to produce values to the nearest integer, try:
val = (float) floor((200.0f*val)-119.5f);