I studied IEEE floating point number from the following link
IEEE floating Point Number
In the above article, I am not clear behind the logic of special operation. Why they have decided the special operation in this way (means why Infinity-Infinity is Nan) and all the others also.
If anyone know, Please help me.
NaN is something like “error” or “unknown”. Every operation with NaN results in NaN. Infinity and -Infinity are introduced to handle overflow in a well-defined way. When overflow happens, you cannot be sure about how much it happened. Therefore two infinite values can be in fact different. Therefore Infinity - Infinity does not have any well-defined value, which is handled by NaN.
C example:
#include <stdio.h>
#include <float.h>
int main() {
double e, f;
/* floats are always implicitly converted to doubles for computation */
/* but intermediate values can be larger */
e = DBL_MAX * 2.0;
f = DBL_MAX * 2.5;
printf("%-12s%e\n", "DBL_MAX", DBL_MAX);
printf("%-12s%e\n", "e", e);
printf("%-12s%e\n", "f", f);
printf("%-12s%e\n", "f - e", f - e);
printf("%-12s%e\n", "f - e expr", DBL_MAX * 2.5 - DBL_MAX * 2.0);
return 0;
Output (GCC 4.3.2, Linux 2.6.26-2-686):
DBL_MAX 1.79769e+308
e inf
f inf
f - e nan
f - e expr 8.98847e+307
The last expression is a tricky one. Why this happens is described elsewhere. Basically the intermediate calculation is done in a larger type.
Why does these two code variants produce different floating-point results?
Intermediate Floating-Point Precision # #AltDevBlog
C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
double temp = x;
x = y;
y = temp;
while (x > y)
rest = x - y;
x = x - y;
return rest;
int main()
typedef std::numeric_limits<double> dbl;
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
Any suggestions for me?
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...
This question already has answers here:
If operator< works properly for floating-point types, why can't we use it for equality testing?
(5 answers)
Closed 9 years ago.
double d = 0.1;
float f = 0.1;
should the expression
(f > d)
return true or false?
Empirically, the answer is true. However, I expected it to be false.
As 0.1 cannot be perfectly represented in binary, while double has 15 to 16 decimal digits of precision, and float has only 7. So, they both are less than 0.1, while the double is more close to 0.1.
I need an exact explanation for the true.
I'd say the answer depends on the rounding mode when converting the double to float. float has 24 binary bits of precision, and double has 53. In binary, 0.1 is:
0.1₁₀ = 0.0001100110011001100110011001100110011001100110011…₂
^ ^ ^ ^
1 10 20 24
So if we round up at the 24th digit, we'll get
0.1₁₀ ~ 0.000110011001100110011001101
which is greater than the exact value and the more precise approximation at 53 digits.
The number 0.1 will be rounded to the closest floating-point representation with the given precision. This approximation might be either greater than or less than 0.1, so without looking at the actual values, you can't predict whether the single precision or double precision approximation is greater.
Here's what the double precision value gets rounded to (using a Python interpreter):
>>> "%.55f" % 0.1
And here's the single precision value:
>>> "%.55f" % numpy.float32("0.1")
So you can see that the single precision approximation is greater.
If you convert .1 to binary you get:
repeating forever
Mapping to data types, you get:
float(.1) = %.00011001100110011001101
^--- note rounding
double(.1) = %.0001100110011001100110011001100110011001100110011010
Convert that to base 10:
float(.1) = .10000002384185791015625
double(.1) = .100000000000000088817841970012523233890533447265625
This was taken from an article written by Bruce Dawson. it can be found here:
Doubles are not floats, so don’t compare them
I think Eric Lippert's comment on the question is actually the clearest explanation, so I'll repost it as an answer:
Suppose you are computing 1/9 in 3-digit decimal and 6-digit decimal. 0.111 < 0.111111, right?
Now suppose you are computing 6/9. 0.667 > 0.666667, right?
You can't have it that 6/9 in three digit decimal is 0.666 because that is not the closest 3-digit decimal to 6/9!
Since it can't be exactly represented, comparing 1/10 in base 2 is like comparing 1/7 in base 10.
1/7 = 0.142857142857... but comparing at different base 10 precisions (3 versus 6 decimal places) we have 0.143 > 0.142857.
Just to add to the other answers talking about IEEE-754 and x86: the issue is even more complicated than they make it seem. There is not "one" representation of 0.1 in IEEE-754 - there are two. Either rounding the last digit down or up would be valid. This difference can and does actually occur, because x86 does not use 64-bits for its internal floating-point computations; it actually uses 80-bits! This is called double extended-precision.
So, even among just x86 compilers, it sometimes happen that the same number is represented two different ways, because some computes its binary representation with 64-bits, while others use 80.
In fact, it can happen even with the same compiler, even on the same machine!
#include <iostream>
#include <cmath>
void foo(double x, double y)
if (std::cos(x) != std::cos(y)) {
std::cout << "Huh?!?\n"; //← you might end up here when x == y!!
int main()
foo(1.0, 1.0);
return 0;
See Why is cos(x) != cos(y) even though x == y? for more info.
The rank of double is greater than that of float in conversions. By doing a logical comparison, f is cast to double and maybe the implementation you are using is giving inconsistent results. If you suffix f so the compiler registers it as a float, then you get 0.00 which is false in double type. Unsuffixed floating types are double.
#include <stdio.h>
#include <float.h>
int main()
double d = 0.1;
float f = 0.1f;
printf("%f\n", (f > d));
return 0;
In MSDN article, it mentions when fp:fast mode is enabled, operations like additive identity (a±0.0 = a, 0.0-a = -a) are unsafe. Is there any example that a+0 != a under such mode?
EDIT: As someone mentioned below, this sort of issue normally comes up when doing comparison. My issue is from comparison, the psedocode looks like below:
if( sum >= threshold) break;
It breaks after adding a value of 0 (v[i]). The v[i] is not from calculation, it is assigned. I understand if my v[i] is from calculation then rounding might come into play, but why even though I give v[i] a zero value, I still have this sum < threshold but sum + v[i] >= threshold?
The reason that it's "unsafe" is that what the compiler assumes to be zero may not really end up being zero, due to rounding errors.
Take this example which adds two floats on the edge of the precision which 32 bit floats allows:
float a = 33554430, b = 16777215;
float x = a + b;
float y = x - a - b;
float z = 1;
z = z + y;
With fp:fast, the compiler says "since x = a + b, y = x - a - b = 0, so 'z + y' is just z". However, due to rounding errors, y actually ends up being -1, not 0. So you would get a different result without fp:fast.
It's not saying something 'fixed' like, "if you set /fp:fast, and variable a happens to be 3.12345, then a+0 might not be a". It's saying that when you set /fp:fast, the compiler will take shortcuts that mean that if you compute a+0, and then compare that to what you stored for a, there is no guarantee that they'll be the same.
There is a great write up on this class of problems (which are endemic to floating point calculations on computers) here: http://www.parashift.com/c++-faq-lite/floating-point-arith2.html
If a is -0.0, then a + 0.0 is +0.0.
it mentions when fp:fast mode is enabled, operations like additive identity (a±0.0 = a, 0.0-a = -a) is unsafe.
What that article says is
Any of the following (unsafe) algebraic rules may be employed by the optimizer when the fp:fast mode is enabled:
And then it lists a±0.0 = a and 0.0-a = -a
It is not saying that these identities are unsafe when fp:fast is enabled. It is saying that these identities are not true for IEEE 754 floating point arithmetic but that /fp:fast will optimize as though they are true.
I'm not certain of an example that shows that a + 0.0 == a to be false (except for NaN, obviously), but IEEE 754 has a lot of subtleties, such as when intermediate values should be truncated. One possibility is that if you have some expression that includes + 0.0, that might result in a requirement under IEEE 754 to do truncation on an intermediate value, but that /fp:fast will generate code that doesn't do the truncation, and consequently later results may differ from what is strictly required by IEEE 754.
Using Pascal Cuoq's info here's a program that produces different output based on /fp:fast
#include <cmath>
#include <iostream>
int main() {
volatile double a = -0.0;
if (_copysign(1.0, a + 0.0) == _copysign(1.0, 0.0)) {
std::cout << "correct IEEE 754 results\n";
} else {
std::cout << "result not IEEE 754 conformant\n";
When built with /fp:fast the program outputs "result not IEEE 754 conformant" while building with /fp:strict cause the program to output "correct IEEE 754 results".
I have an arithmetic expression, for example:
float z = 8.0
float x = 3.0;
float n = 0;
cout << z / (x/n) + 1 << endl;
Why I get normal answer equal to 1, when it should be "nan", "1.#inf", etc.?
I assume you're using floating point arithmetic (though one can't be sure, because you're not telling us).
IEEE754 floating point semantics work on the extended real line and include infinities on both ends. This makes divisions with non-zero numerator well-defined for any (non-NaN) denominator, "consistent with" (i.e. extending continuously) the usual arithmetic rules: x / n is infinity, and z divided by infinity is zero — just as if you had simplified the expression as n * z / x.
The only genuinely undefined quantities are 0/0 and inf/inf, which are represented by the special value NaN.
The IEEE 754 specifies that 3/0 = Inf (or anything positive instead of 3). 8/Inf gives 0. If you add 1 you'll receive 1. This is because 0 denotes "0 or something very close to it" and Inf "Infinity or very big number". It also allows to perform some operations on limits as it effectively extends the real numbers into by infinities. NaN's are reserved when the limit is not achievable (or not easily computable by simple implementation).
As a side effect you have some strange effects like 0 == -0 but 1/0 == Inf and 1/-0 == -Inf. It is important to remember that FP arithmetic is not normal - for example cos(x) * cos(x) + sin(x) * sin(x) - 1 != 0 even if x != NaN && x != Inf && x != -Inf. For floats and x == 1 the result is -5.9604645e-8. Therefore not all expectation can be easily transferred to it - like division by 0 in this case.
While C/C++ does not mandate that IEE 754 specification will be used for floating point numbers it is the specification right now and is implemented on virtually any hardware and for that reason used by most C/C++ implementations.
I was wondering if there is a way of overcoming an accuracy problem that seems to be the result of my machine's internal representation of floating-point numbers:
For the sake of clarity the problem is summarized as:
// str is "4.600"; atof( str ) is 4.5999999999999996
double mw = atof( str )
// The variables used in the columns calculation below are:
// mw = 4.5999999999999996
// p = 0.2
// g = 0.2
// h = 1 (integer)
int columns = (int) ( ( mw - ( h * 11 * p ) ) / ( ( h * 11 * p ) + g ) ) + 1;
Prior to casting to an integer type the result of the columns calculation is 1.9999999999999996; so near yet so far from the desired result of 2.0.
Any suggestions most welcome.
When you use floating point arithmetic strict equality is almost meaningless. You usually want to compare with a range of acceptable values.
Note that some values can not be represented exactly as floating point vlues.
See What Every Computer Scientist Should Know About Floating-Point Arithmetic and Comparing floating point numbers.
There's no accurracy problem.
The result you got (1.9999999999999996) differed from the mathematical result (2) by a margin of 1E-16. That's quite accurate, considering your input "4.600".
You do have a rounding problem, of course. The default rounding in C++ is truncation; you want something similar to Kip's solution. Details depend on your exact domain, do you expect round(-x)== - round(x) ?
If you haven't read it, the title of this paper is really correct. Please consider reading it, to learn more about the fundamentals of floating-point arithmetic on modern computers, some pitfalls, and explanations as to why they behave the way they do.
A very simple and effective way to round a floating point number to an integer:
int rounded = (int)(f + 0.5);
Note: this only works if f is always positive. (thanks j random hacker)
If accuracy is really important then you should consider using double precision floating point numbers rather than just floating point. Though from your question it does appear that you already are. However, you still have a problem with checking for specific values. You need code along the lines of (assuming you're checking your value against zero):
if (abs(value) < epsilon)
// Do Stuff
where "epsilon" is some small, but non zero value.
On computers, floating point numbers are never exact. They are always just a close approximation. (1e-16 is close.)
Sometimes there are hidden bits you don't see. Sometimes basic rules of algebra no longer apply: a*b != b*a. Sometimes comparing a register to memory shows up these subtle differences. Or using a math coprocessor vs a runtime floating point library. (I've been doing this waayyy tooo long.)
C99 defines: (Look in math.h)
double round(double x);
float roundf(float x);
long double roundl(long double x);
Or you can roll your own:
template<class TYPE> inline int ROUND(const TYPE & x)
{ return int( (x > 0) ? (x + 0.5) : (x - 0.5) ); }
For floating point equivalence, try:
template<class TYPE> inline TYPE ABS(const TYPE & t)
{ return t>=0 ? t : - t; }
template<class TYPE> inline bool FLOAT_EQUIVALENT(
const TYPE & x, const TYPE & y, const TYPE & epsilon )
{ return ABS(x-y) < epsilon; }
Use decimals: decNumber++
You can read this paper to find what you are looking for.
You can get the absolute value of the result as seen here:
x = 0.2;
y = 0.3;
equal = (Math.abs(x - y) < 0.000001)