Why float taking 0.699999 instead of 0.7 [duplicate] - c++

This question already has answers here:
Floating point comparison [duplicate]
(5 answers)
Closed 9 years ago.
Here x is taking 0.699999 instead of 0.7 but y is taking 0.5 as assigned. Can you tell me what is the exact reason for this behavior.
#include<iostream>
using namespace std;
int main()
{
float x = 0.7;
float y = 0.5;
if (x < 0.7)
{
if (y < 0.5)
cout<<"2 is right"<<endl;
else
cout<<"1 is right"<<endl;
}
else
cout<<"0 is right"<<endl;
cin.get();
return 0;
}

There are lots of things on the internet about IEEE floating point.
0.5 = 1/2
so can be written exactly as a sum of powers of two
0.7 = 7/10 = 1/2 + 1/5 = 1/2 + 1/8 + a bit more... etc
The bit more can never be exactly a power of two, so you get the closest it can manage.

It is to do with how floating points are represented in memory. They have a limited number of bits (usually 32 for a float). This means there are a limited number of values that can be represented which means that many numbers from the infinite set of real numbers cannot be represented.
This website explains further

If you want to understand exactly why, then have a look at floating point representation of your machine (most probably it's IEEE 754, https://en.wikipedia.org/wiki/IEEE_floating_point ).
If you want to write robust and portable code, never compare floating-point values for equality. You should always compare them with some precision (e.g. instead of x==y you should write fabs(x-y) < eps where eps is say 1e-6).

floating point representation is approximate only as you cannot have precise representation of real, non-rational numbers on a computer.
`
when operating on floats, errros will in general accumulate.
however, there are some reals which can be represented exactly on a digital computer using it's native datatype for this purpose (*), 0.5 being one of them.
(*) meaning the format the floating point processing unit of the cpu operates on (standardized in ieee754). specialized libraries can represent integer and rational numbers exactly beyond the limits of the processor's internal formats. rounding errors may still occur when converting into a human-readable decimal expansion and the alternative also does not extend to irrational numbers (e.g. sqrt(3)). and, of course, these libraries comes at the cost of less speed.

Related

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With “something” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?
Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: “fmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f − 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

problem in using 'double' data type in for loops with fractional incrementation [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 3 years ago.
I had a program which requires one to search values from -100.00 to +100.00 with incrementation of 0.01 inside a for loop. But the if conditions arent working properly even if code is correct...
As an example I tried printing a small section i.e if(i==1.5){cout<<"yes...";}
it was not working even though the code was attaining the value i=1.5, i verified that by printing each of the values too.
#include<iostream>
#include<stdio.h>
using namespace std;
int main()
{
double i;
for(i=-1.00; i<1.00; i=i+0.01)
{
if(i>-0.04 && i<0.04)
{
cout<<i;
if(i==0.01)
cout<<"->yes ";
else
cout<<"->no ";
}
}
return 0;
}
Output:
-0.04->no -0.03->no -0.02->no -0.02->no -0.01->no 7.5287e-016->no 0.01->no 0.02->no 0.03->no
Process returned 0 (0x0) execution time : 1.391
(notice that 0.01 is being attained but still it prints 'no')
(also notice that 0.04 is being printed even if it wasn't instructed to do so)
use this if(abs(i - 0.01) < 0.00000000001) instead.
double - double precision floating point type. Usually IEEE-754 64 bit
floating point type
The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.01, which is 1/100) whose denominator is not a power of two cannot be exactly represented.
In simple word, if the number can't be represented by a sum of 1/(2^n) you don't have the exact number you want to use. So to compare two double numbers calculate the absolute difference between them and use a tolerance value e.g. 0.00000000001 .
Doubles are stored in binary format. To cut things short fractional part is written as binary. Now let's imagine it's size is 1 bit. So you've two possible values (for fraction only): .0 and .5. With two bits you have: .0 .25 .5 .75. With three bits: .125 .25 .375 .5 .625 .75 .875. And so on. But you'll never get 0.1. So what computer does? It cheats. It lies to you, that 0.1 you see is 0.1. While it more looks like 0.1000000000000000002 or something like this. Why it looks like 0.1? Because formatting of floating point values has long standing tradition of rounding numbers, so 0.10000000000001 becomes 0.1. As a result 0.1 * 10 won't equal 1.0.
Correct solution is to avoid floating point numbers, unless you don't care for precision. If your program breaks, once your floating point value "changes" by miniscule amount, then you need to find another way. In your case using non-fractional values will be enough:
for(auto ii=-100; ii<100; ++ii)
{
if(ii>-4 && ii<4)
{
cout << (ii / 100.0);
if(ii==1)
cout<<"->yes ";
else
cout<<"->no ";
}
}

0.1 float is greater than 0.1 double. I expected it to be false [duplicate]

This question already has answers here:
If operator< works properly for floating-point types, why can't we use it for equality testing?
(5 answers)
Closed 9 years ago.
Let:
double d = 0.1;
float f = 0.1;
should the expression
(f > d)
return true or false?
Empirically, the answer is true. However, I expected it to be false.
As 0.1 cannot be perfectly represented in binary, while double has 15 to 16 decimal digits of precision, and float has only 7. So, they both are less than 0.1, while the double is more close to 0.1.
I need an exact explanation for the true.
I'd say the answer depends on the rounding mode when converting the double to float. float has 24 binary bits of precision, and double has 53. In binary, 0.1 is:
0.1₁₀ = 0.0001100110011001100110011001100110011001100110011…₂
^ ^ ^ ^
1 10 20 24
So if we round up at the 24th digit, we'll get
0.1₁₀ ~ 0.000110011001100110011001101
which is greater than the exact value and the more precise approximation at 53 digits.
The number 0.1 will be rounded to the closest floating-point representation with the given precision. This approximation might be either greater than or less than 0.1, so without looking at the actual values, you can't predict whether the single precision or double precision approximation is greater.
Here's what the double precision value gets rounded to (using a Python interpreter):
>>> "%.55f" % 0.1
'0.1000000000000000055511151231257827021181583404541015625'
And here's the single precision value:
>>> "%.55f" % numpy.float32("0.1")
'0.1000000014901161193847656250000000000000000000000000000'
So you can see that the single precision approximation is greater.
If you convert .1 to binary you get:
0.000110011001100110011001100110011001100110011001100...
repeating forever
Mapping to data types, you get:
float(.1) = %.00011001100110011001101
^--- note rounding
double(.1) = %.0001100110011001100110011001100110011001100110011010
Convert that to base 10:
float(.1) = .10000002384185791015625
double(.1) = .100000000000000088817841970012523233890533447265625
This was taken from an article written by Bruce Dawson. it can be found here:
Doubles are not floats, so don’t compare them
I think Eric Lippert's comment on the question is actually the clearest explanation, so I'll repost it as an answer:
Suppose you are computing 1/9 in 3-digit decimal and 6-digit decimal. 0.111 < 0.111111, right?
Now suppose you are computing 6/9. 0.667 > 0.666667, right?
You can't have it that 6/9 in three digit decimal is 0.666 because that is not the closest 3-digit decimal to 6/9!
Since it can't be exactly represented, comparing 1/10 in base 2 is like comparing 1/7 in base 10.
1/7 = 0.142857142857... but comparing at different base 10 precisions (3 versus 6 decimal places) we have 0.143 > 0.142857.
Just to add to the other answers talking about IEEE-754 and x86: the issue is even more complicated than they make it seem. There is not "one" representation of 0.1 in IEEE-754 - there are two. Either rounding the last digit down or up would be valid. This difference can and does actually occur, because x86 does not use 64-bits for its internal floating-point computations; it actually uses 80-bits! This is called double extended-precision.
So, even among just x86 compilers, it sometimes happen that the same number is represented two different ways, because some computes its binary representation with 64-bits, while others use 80.
In fact, it can happen even with the same compiler, even on the same machine!
#include <iostream>
#include <cmath>
void foo(double x, double y)
{
if (std::cos(x) != std::cos(y)) {
std::cout << "Huh?!?\n"; //← you might end up here when x == y!!
}
}
int main()
{
foo(1.0, 1.0);
return 0;
}
See Why is cos(x) != cos(y) even though x == y? for more info.
The rank of double is greater than that of float in conversions. By doing a logical comparison, f is cast to double and maybe the implementation you are using is giving inconsistent results. If you suffix f so the compiler registers it as a float, then you get 0.00 which is false in double type. Unsuffixed floating types are double.
#include <stdio.h>
#include <float.h>
int main()
{
double d = 0.1;
float f = 0.1f;
printf("%f\n", (f > d));
return 0;
}

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 → 2.679
12.3499999 → 12.35
1.20000001 → 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 → 3 decimal digits
6.15634 → 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 → 5 decimal digits
1.113 * 6.15634 = 6.85200642 → 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.

Why comparing double and float leads to unexpected result? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
strange output in comparision of float with float literal
float f = 1.1;
double d = 1.1;
if(f == d) // returns false!
Why is it so?
The important factors under consideration with float or double numbers are:
Precision & Rounding
Precision:
The precision of a floating point number is how many digits it can represent without losing any information it contains.
Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3′s going out to infinity. An infinite length number would require infinite memory to be depicted with exact precision, but float or double data types typically only have 4 or 8 bytes. Thus Floating point & double numbers can only store a certain number of digits, and the rest are bound to get lost. Thus, there is no definite accurate way of representing float or double numbers with numbers that require more precision than the variables can hold.
Rounding:
There is a non-obvious differences between binary and decimal (base 10) numbers.
Consider the fraction 1/10. In decimal, this can be easily represented as 0.1, and 0.1 can be thought of as an easily representable number. However, in binary, 0.1 is represented by the infinite sequence: 0.00011001100110011…
An example:
#include <iomanip>
int main()
{
using namespace std;
cout << setprecision(17);
double dValue = 0.1;
cout << dValue << endl;
}
This output is:
0.10000000000000001
And not
0.1.
This is because the double had to truncate the approximation due to it’s limited memory, which results in a number that is not exactly 0.1. Such an scenario is called a Rounding error.
Whenever comparing two close float and double numbers such rounding errors kick in and eventually the comparison yields incorrect results and this is the reason you should never compare floating point numbers or double using ==.
The best you can do is to take their difference and check if it is less than an epsilon.
abs(x - y) < epsilon
Try running this code, the results will make the reason obvious.
#include <iomanip>
#include <iostream>
int main()
{
std::cout << std::setprecision(100) << (double)1.1 << std::endl;
std::cout << std::setprecision(100) << (float)1.1 << std::endl;
std::cout << std::setprecision(100) << (double)((float)1.1) << std::endl;
}
The output:
1.100000000000000088817841970012523233890533447265625
1.10000002384185791015625
1.10000002384185791015625
Neither float nor double can represent 1.1 accurately. When you try to do the comparison the float number is implicitly upconverted to a double. The double data type can accurately represent the contents of the float, so the comparison yields false.
Generally you shouldn't compare floats to floats, doubles to doubles, or floats to doubles using ==.
The best practice is to subtract them, and check if the absolute value of the difference is less than a small epsilon.
if(std::fabs(f - d) < std::numeric_limits<float>::epsilon())
{
// ...
}
One reason is because floating point numbers are (more or less) binary fractions, and can only approximate many decimal numbers. Many decimal numbers must necessarily be converted to repeating binary "decimals", or irrational numbers. This will introduce a rounding error.
From wikipedia:
For instance, 1/5 cannot be represented exactly as a floating point number using a binary base but can be represented exactly using a decimal base.
In your particular case, a float and double will have different rounding for the irrational/repeating fraction that must be used to represent 1.1 in binary. You will be hard pressed to get them to be "equal" after their corresponding conversions have introduced different levels of rounding error.
The code I gave above solves this by simply checking if the values are within a very short delta. Your comparison changes from "are these values equal?" to "are these values within a small margin of error from each other?"
Also, see this question: What is the most effective way for float and double comparison?
There are also a lot of other oddities about floating point numbers that break a simple equality comparison. Check this article for a description of some of them:
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
The IEEE 754 32-bit float can store: 1.1000000238...
The IEEE 754 64-bit double can store: 1.1000000000000000888...
See why they're not "equal"?
In IEEE 754, fractions are stored in powers of 2:
2^(-1), 2^(-2), 2^(-3), ...
1/2, 1/4, 1/8, ...
Now we need a way to represent 0.1. This is (a simplified version of) the 32-bit IEEE 754 representation (float):
2^(-4) + 2^(-5) + 2^(-8) + 2^(-9) + 2^(-12) + 2^(-13) + ... + 2^(-24) + 2^(-25) + 2^(-27)
00011001100110011001101
1.10000002384185791015625
With 64-bit double, it's even more accurate. It doesn't stop at 2^(-25), it keeps going for about twice as much. (2^(-48) + 2^(-49) + 2^(-51), maybe?)
Resources
IEEE 754 Converter (32-bit)
Floats and doubles are stored in a binary format that can not represent every number exactly (it's impossible to represent the infinitely many possible different numbers in a finite space).
As a result they do rounding. Float has to round more than double, because it is smaller, so 1.1 rounded to the nearest valid Float is different to 1.1 rounded to the nearest valud Double.
To see what numbers are valid floats and doubles see Floating Point