0 + 0 + 0... + 0 != 0 - c++

I have a program that is finding paths in a graph and outputting the cumulative weight. All of the edges in the graph have an individual weight of 0 to 100 in the form of a float with at most 2 decimal places.
On Windows/Visual Studio 2010, for a particular path consisting of edges with 0 weight, it outputs the correct total weight of 0. However on Linux/GCC the program is saying the path has a weight of 2.35503e-38. I have had plenty of experiences with crazy bugs caused by floats, but when would 0 + 0 ever equal anything other than 0?
The only thing I can think of that is causing this is the program does treat some of the weights as integers and uses implicit coercion to add them to the total. But 0 + 0.0f still equals 0.0f!
As a quick fix I reduce the total to 0 when less then 0.00001 and that is sufficient for my needs, for now. But what vodoo causes this?
NOTE: I am 100% confident that none of the weights in the graph exceed the range I mentioned and that all of the weights in this particular path are all 0.
EDIT: To elaborate, I have tried both reading the weights from a file and setting them in the code manually as equal to 0.0f No other operation is being performed on them other than adding them to the total.

Because it's an IEEE floating point number, and it's not exactly equal to zero.
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm

[...] in the form of a float with at most 2 decimal places.
There is no such thing as a float with at most 2 decimal places. Floats are almost always represented as a binary floating point number (fractional binary mantissa and integer exponent). So many (most) numbers with 2 decimal places cannot be represented exactly.
For example, 0.20f may look as an innocent and round fraction, but
printf("%.40f\n", 0.20f);
will print: 0.2000000029802322387695312500000000000000.
See, it does not have 2 decimal places, it has 26!!!
Naturally, for most practical uses the difference in negligible. But if you do some calculations you may end up increasing the rounding error and making it visible, particularly around 0.

It may be that your floats containing values of "0.0f" aren't actually 0.0f (bit representation 0x00000000), but a very, very small number that evaluates to about 0.0. Because of the way IEEE754 spec defines float representations, if you have, for example, a very small mantissa and a 0 exponent, while it's not equal to absolute 0, it will round to 0. However, if you add these numbers together a sufficiently number of times, the very small amount will accumulate into a value that eventually will become non-zero.
Here is an example case which gives the illusion of 0 being non-zero:
float f = 0.1f / 1000000000;
printf("%f, %08x\n", f, *(unsigned int *)&f);
float f2 = f * 10000;
printf("%f, %08x\n", f2, *(unsigned int *)&f2);
If you are assigning literals to your variables and adding them, though, it is possible that either the compiler is not translating 0 into 0x0 in memory. If it is, and this still is happening, then it's also possible that your CPU hardware has a bug relating to turning 0s into non-zero when doing ALU operations that may have squeaked by their validation efforts.
However, it is good to remember that IEEE floating point is only an approximation, and not an exact representation of any particular float value. So any floating-point operations are bound to have some amount of error.

Related

If a floating-point number is representable in my machine, will its inverse be representable in my machine?

Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
In other words, and more precisely, if I have a positive real number x and an object x1 of type double whose stored value represents x exactly, is it guaranteed that the value represented by DBL_EPSILON is less than the real number 1/x?
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
I will assume double is IEEE 754 binary64.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
Not necessarily, for two reasons:
The inverse might not be a floating-point number.
For example, although 3 is a floating-point number, 1/3 is not.
The inverse might overflow.
For example, the inverse of 2βˆ’1074 is 21074, which is not only larger than all finite floating-point numbers but more than halfway from the largest finite floating-point number, 1.fffffffffffffp+1023 = 21024 βˆ’ 2971, to what would be the next one after that, 21024, if the range of exponents were larger.
So the inverse of 2βˆ’1074 is rounded to infinity.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
The smallest such 𝑣 is always zero.
If you restrict it to be nonzero, it will always be the smallest subnormal floating-point number, 0x1pβˆ’1074, or roughly 4.9406564584124654 × 10βˆ’324, irrespective of π‘₯ (unless π‘₯ is infinite).
But perhaps you want the largest such 𝑣 rather than the smallest such 𝑣.
The largest such 𝑣 is always either 1 ⊘ π‘₯ = fl(1/π‘₯) (that is, the floating-point number nearest to 1/π‘₯, which is what you get by writing 1/x in C), or the next floating-point number closer to zero (which you can get by writing nextafter(1/x, 0) in C): in the default rounding mode, the division operator always returns the nearest floating-point number to the true quotient, or one of the two nearest ones if there is a tie.
You can also get the largest such 𝑣 by setting the rounding mode with fesetround(FE_DOWNWARD) or fesetround(FE_TOWARDZERO) and then just computing 1/x, although toolchain support for non-default rounding modes is spotty and mostly they serve to shake out bugs in ill-conditioned code rather than to give reliable rounding semantics.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
1/x is never rounded to zero unless π‘₯ is infinite or you have nonstandard flush-to-zero semantics enabled (so results which would ordinarily be subnormal are instead rounded to zero, such as when π‘₯ is the largest finite floating-point number 0x1.fffffffffffffp+1023).
But flush-to-zero aside, there are many values of π‘₯ for which 1/π‘₯ and fl(1/π‘₯) = 1/x is smaller than DBL_EPSILON.
For example, if π‘₯ = 0x1p+1000 (that is, 21000 β‰ˆ 1.0715086071862673 × 10301), then 1/π‘₯ = fl(1/π‘₯) = 1/x = 0x1pβˆ’1000 (that is, 2βˆ’1000 β‰ˆ 9.332636185032189 × 10βˆ’302) is far below DBL_EPSILON = 0x1pβˆ’52 (that is, 2βˆ’52 β‰ˆ 2.220446049250313  × 10βˆ’16).
1/π‘₯ in this case is a floating-point number, so the reciprocal is computed exactly in floating-point arithmetic; there is no rounding at all.
The largest floating-point number below 1/π‘₯ in this case is 0x1.fffffffffffffpβˆ’1001, or 2βˆ’1000 βˆ’ 2βˆ’1053.
DBL_EPSILON (2βˆ’52) is not the smallest floating-point number (2βˆ’1074), or even the smallest normal floating-point number (2βˆ’1022).
Rather, DBL_EPSILON is the distance from 1 to the next larger floating-point number, 1 + 2βˆ’52, sometimes written ulp(1) to indicate that it is the magnitude of the least significant digit, or unit in the last place, in the floating-point representation of 1.
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
That would be 1/DBL_EPSILON - 1, or 252 βˆ’ 1.
But what do you want this number for?
Why are you trying to use DBL_EPSILON here?
The inverse of positive infinity is, of course, smaller than any positive rational number. Beyond that, even the largest finite floating point number has a multiplicative inverse well above the smallest representable floating point number of equivalent width, thanks to denormal numbers.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
No. There is no specification that 1.0/DBL_MIN <= DBL_MAX and 1.0/DBL_MAX <= DBL_MIN both must be true. One is usually true. With sub-normals, 1.0/sub-normal is often > DBL_MAX.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
This is true as v could be zero unless for some large x like DBL_MAX, 1.0/x is zero. That is a possibility. With sub-normals, that is rarely the case as 1.0/DBL_MAX is representable as a value more than 0.
DBL_EPSILON has little to do with the above. OP's issues are more dependent on DBL_MAX, DBL_MIN and is the double supports sub-normals. Many FP encodings about balanced where 1/DBL_MIN is somewhat about DBL_MIN, yet C does not require that symmetry.
No. Floating point numbers are balanced around 1.0 to minimize the effect of calculating inverses, but this balance is not exact, ad the middle point for the exponent (the value 0x3fff... fot the exponent, gives the same number of powers of two above and below 1.0. But the exponent value 0x4ffff... is reserved for infinity and then nans, while the value 0x0000... is reserved for denormals (also called subnormals) These values are not normalized (and some architectures don't even implement them), but in those that implement, they add as many bits as the width of the mantissa as powers of 2 in addition (but with with lower precision) to the normalized ones, in the range of
the negative exponents. This means that you have a set o numbers, quite close to zero, for which when you compute their inverses, you always get infinity.
For doubles you have 52 more powers of two, or around 15 more powers of ten. For floats, this is around 7 more powers of ten.
But this also means that if you calculate the inverse of a large number you'll always get a number different than zero.

Why does adding 1 to numeric_limits<float>::min() return 1?

How come subtracting 1 from float max returns a sensible value, but adding 1 to float min returns 1?
I thought that if you added or subtracted a value smaller than the epsilon for that particular magnitude, then nothing would happen and there would be no increase or decrease.
Here is the code I compiled with g++ with no flags and ran on x86_64.
#include <limits>
#include <iostream>
int main() {
float min = std::numeric_limits<float>::min() + 1;
float max = std::numeric_limits<float>::max() - 1;
std::cout << min << std::endl << max << std::endl;
return 0;
}
Outputs this:
1
3.40282e+38
I would expect it to output this:
-3.40282e+38
3.40282e+38
std::numeric_limits<float>::min() returns the smallest normalized positive value. To get the value that has no value lower than it, use std::numeric_limits<float>::lowest().
https://en.cppreference.com/w/cpp/types/numeric_limits/min
min is the smallest-magnitude positive normalized float, a very tiny positive number (about 1.17549e-38), not a negative number with large magnitude. Notice that the - is in the exponent, and this is scientific notation. e-38 means 38 zeros after the decimal point. Try it out on https://www.h-schmidt.net/FloatConverter/IEEE754.html to play with the bits in a binary float.
std::numeric_limits<float>::min() is the minimum magnitude normalized float, not -max. CppReference even has a note about this possibly being surprising.
Do you know why that was picked to be the value for min() rather than the lowest negative value? Seems to be an outlier with regards to all the other types.
Some of the sophistication in numeric_limits<T> like lowest and denorm_min is new in C++11. Most of the choice of what to define mostly followed C. Historical C valued economy and didn't define a lot of different names. (Smaller is better on ancient computers, and also less stuff in the global namespace which is all C had access to.)
Float types are normally1 symmetric around 0 (sign/magnitude representation), so C didn't have a separate named constant for the most-negative float / double / long double. Just FLT_MAX and FLT_MIN CPP macros. C doesn't have templates, so you know when you're writing FP code and can use a - on the appropriate constant if necessary.
If you're only going to have a few named constants, the three most interesting ones are:
FLT_EPSILON tells you about the available precision (mantissa bits): nextafter(1.0, +INF) - 1.0
FLT_MIN / FLT_MAX min (normalized) and max magnitudes of finite floats. This depends mostly on how many exponent bits a float has.
They're not quite symmetric around 1.0 for 2 reasons: all-ones mantissa in FLT_MAX, and gradual underflow (subnormals) taking up the lowest exponent-field (0 with bias), but FLT_MIN ignoring subnormals. FLT_MIN * FLT_MAX is about 3.99999976 for IEEE754 binary32 float. (You normally want to avoid subnormals for performance reasons, and so you have room for gradual underflow, so it makes sense that FLT_MIN isn't denorm_min)
(Fun fact: 0.0 is a special case of a subnormal: exponent field = 0 implying a mantissa of 0.xxx instead of 1.xxx).
Footnote 1: CppReference points out that C++11 std::numeric_limits<T>::lowest() could be different from -max for 3rd-party FP types, but isn't for standard C++ FP types.
lowest is what you wanted: the most-negative finite value. It's consistent across integer and FP types as being the most-negative value, so for example you could use it as an initializer for a templated search loop that uses std::min to find the lowest value in an array.
C++11 also introduced denorm_min, the minimum positive subnormal aka denormal value for FP types. In IEEE754, the object representation has all bits 0 except for a 1 in the low bit of the mantissa.
The float result for 1.0 + 1.17549e-38 (after rounding to the nearest float) is exactly 1.0. min is lower than std::numeric_limits<float>::epsilon so the entire change is lost to rounding error when added to 1.0.
So even if you did print the float with full precision (or as a hex float), it would be 1.0. But you're just printing with the default formatting for cout which rounds to some limited precision, like 6 decimal digits. https://en.cppreference.com/w/cpp/io/manip/setprecision
(An earlier version of the question included the numeric value of min ~= 1.17549e-38; this answer started out addressing that mixup and I haven't bothered to fully rewrite those parts).

Find float a to closest multiple of float b

C++ Scenario: I have two variables of type double a and b.
Goal: a should be set to the closest multiple of b that is smaller than a.
First approach: Use fmod() or remainder() to get r. Then do a = a - r.
I know that due to the representation of decimal numbers in memory fmod() or remainder() can never guarantee 100% accuracy. In my tests I found that I cannot use fmod() at all, as the variance of its results is too unpredictable (at least as far as I understand). There are many questions and discussions out there talking about this phenomenon.
So is there something I could do to still use fmod()?
With β€œsomething” I mean some trick similar to checking if a equals b by employing a value double
EPSILON = 0.005;
if (std::abs(a-b) < EPSILON)
std::cout << "equal" << '\n';
My second approach works but seems not to be very elegant. I am just subtracting b from a until there is nothing left to subtract:
double findRemainder(double x, double y) {
double rest;
if (y > x)
{
double temp = x;
x = y;
y = temp;
}
while (x > y)
{
rest = x - y;
x = x - y;
}
return rest;
}
int main()
{
typedef std::numeric_limits<double> dbl;
std::cout.precision(dbl::max_digits10);
double a = 13.78, b = 2.2, r = 0;
r = findRemainder(a, b);
return 0;
}
Any suggestions for me?
Preamble
The problem is impossible, both as stated and as intended.
Remainders are exact
This statement is incorrect: β€œfmod() or remainder() can never guarantee 100% accuracy.” If the floating-point format supports subnormal numbers (as IEEE-754 does), then fmod(x, y) and remainder are both exact; they produce a result with no rounding error (barring bugs in their implementation). The remainder, as defined for either of them, is always less than y and not more than x in magnitude. Therefore, it is always in a portion of the floating-point format that is at least as fine as y and as x, so all the bits needed for the real-arithmetic remainder can be represented in the floating-point remainder. So a correct implementation will return the exact remainder.
Multiples may not be representable
For simplicity of illustration, I will use IEEE-754 binary32, the format commonly used for float. The issues are the same for other formats. In this format, all integers with magnitude up to 224, 16,777,216, are representable. After that, due to the scaling by the floating-point exponent, the representable values increase by two: 16,777,218, 16,777,220, and so on. At 225, 33,554,432, they increase by four: 33,554,436, 33,554,440. At 226, 67,108,864, they increase by eight.
100,000,000 is representable, and so are 99,999,992 and 100,000,008. Now consider asking what multiple of 3 is the closest to 100,000,000. It is 99,999,999. But 99,999,999 is not representable in the binary32 format.
Thus, it is not always possible for a function to take two representable values, a and b, and return the greatest multiple of b that is less than a, using the same floating-point format. This is not because of any difficulty computing the multiple but simply because it is impossible to represent the true multiple in the floating-point format.
In fact, given the standard library, it is easy to compute the remainder; std::fmod(100000000.f, 3.f) is 1. But it is impossible to compute 100000000.f βˆ’ 1 in the binary32 format.
The intended question is impossible
The examples shown, 13.78 for a and 2.2 for b, suggest the desire is to produce a multiple for some floating-point numbers a and b that are the results of converting decimal numerals a and b to the floating-point format. However, once such conversions are performed, the original numbers cannot be known from the results a and b.
To see this, consider values for a of either 99,999,997 or 100,000,002 while b is 10. The greatest multiple of 10 less than 99,999,997 is 99,999,990, and the greatest multiple of 10 less than 100,000,002 is 100,000,000.
When either 99,999,997 or 100,000,002 is converted to the binary32 format (using the common method, round-to-nearest-ties-to-even), the result for a is 100,000,000. Converting b of course yields 10 for b.
Then a function that converts the greatest multiple of a that is less than b can return only one result. Even if this function uses extended precision (say binary64) so that it can return either 99,999,990 or 100,000,000 even though those are not representable in binary32, it has no way to distinguish them. Whether the original a is 99,999,997 or 100,000,002, the a given to the function is 100,000,000, so there is no way for it to know the original a and no way for it to decide which result to return.
Hmm,
there really is a problem of definition, because most multiples of a floating point won't be representable exactly, except maybe if the multiplier is a power of two.
Taking your example and Smalltalk notations (which does not really matter, I do it just because i can evaluate and verify the expressions I propose), the exact fractional representation of double precision 0.1 and 0.9 can be written:
(1+(1<<54)reciprocal) / 10 = 0.1.
(9+(1<<52)reciprocal) / 10 = 0.9.
<< is a bistshift, 1<<54 is 2 raised to the power of 54, and reciprocal is its inverse 2^-54.
As you can easily see:
(1+(1<<54)reciprocal) * 9 > (9+(1<<52)reciprocal)
That is, the exact multiple of 0.1 is greater than 0.9.
Thus, technically, the answer is 8*0.1 (which is exact in this lucky case)
(8+(1<<51)reciprocal) / 10 = 0.8.
What remainder does is to give the EXACT remainder of the division, so it is related to above computations somehow.
You can try it, you will find something like-2.77555...e-17, or exactly (1<<55) reciprocal. The negative part is indicating that nearest multiple is close to 0.9, but a bit below 0.9.
However, if your problem is to find the greatest <= 0.9, among the rounded to nearest multiple of 0.1, then your answer will be 0.9, because the rounded product is 0.1*9 = 0.9.
You have to first resolve that ambiguity. If ever, you are not interested in multiples of 0.1, but in multiples of (1/10), then it's again a different matter...

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 β†’ 2.679
12.3499999 β†’ 12.35
1.20000001 β†’ 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 β†’ 3 decimal digits
6.15634 β†’ 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 β†’ 5 decimal digits
1.113 * 6.15634 = 6.85200642 β†’ 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.

Detecting precision loss when converting from double to float

I am writing a piece of code in which i have to convert from double to float values. I am using boost::numeric_cast to do this conversion which will alert me of any overflow/underflow. However i am also interested in knowing if that conversion resulted in some precision loss or not.
For example
double source = 1988.1012;
float dest = numeric_cast<float>(source);
Produces dest which has value 1988.1
Is there any way available in which i can detect this kind of precision loss/rounding
You could cast the float back to a double and compare this double to the original - that should give you a fair indication as to whether there was a loss of precision.
float dest = numeric_cast<float>(source);
double residual = source - numeric_cast<double>(dest);
Hence, residual contains the "loss" you're looking for.
Look at these articles for single precision and double precision floats. First of all, floats have 8 bits for the exponent vs. 11 for a double. So anything bigger than 10^127 or smaller than 10^-126 in magnitude is going to be the overflow as you mentioned. For the float, you have 23 bits for the actual digits of the number, vs 52 bits for the double. So obviously, you have a lot more digits of precision for the double than float.
Say you have a number like: 1.1123. This number may not actually be encoded as 1.1123 because the digits in a floating point number are used to actually add up as fractions. For example, if your bits in the mantissa were 11001, then the value would be formed by 1 (implicit) + 1 * 1/2 + 1 * 1/4 + 0 * 1/8 + 0 * 1/16 + 1 * 1/32 + 0 * (64 + 128 + ...). So the exact value cannot be encoded unless you can add up these fractions in such a way that it's the exact number. This is rare. Therefore, there will almost always be a precision loss.
You're going to have a certain level of precision loss, as per Dave's answer. If, however, you want to focus on quantifying it and raising an exception when it exceeds a certain number, you will have to open up the floating point number itself and parse out the mantissa & exponent, then do some analysis to determine if you've exceeded your tolerance.
But, the good news, its usually the standard IEEE floating-point float. :-)