machine precision - c++

I wonder if there is something like eps to represent the value of machine precision in C++? Can I use it as the smallest positive number that a double can represent? Is it possible to use 1.0/eps as the max positive number that a double can represent? Where can I find eps in both C++ and C standard libraries?
Thanks and regards!
UPDATE:
For my purpose, I would like to compute a weight as reciprocal of a distance for something like inverse distance weighting interpolation (http://en.wikipedia.org/wiki/Inverse_distance_weighting).
double wgt = 0, wgt_tmp, result = 0;
for (int i = 0; i < num; i++)
{
wgt_tmp = 1.0/dist[i];
wgt += wgt_tmp;
result += wgt_tmp * values[i];
}
results /= wgt;
However the distance can be 0 and I need to make the weight suitable for computation. If there is only one distance dist[i] is 0, I would like its corresponding value values[i] to be dominant. If there are several distances are 0, I would like to have their values to contribute equally to the result. Any idea how to implement it?

Using #include <limits> you have
Small positive value = std::numeric_limits<float>::denorm_min()
Largest positive value = std::numeric_limits<float>::max()
Obviously this applies to other types as well.
See numeric_limits
And no, the inverse of the smallest positive value does not equal the largest.

Just looking for numeric limits information?
The link shows how to find the epsilon, denormalized min, etc., using the C++ Standard Library. There is no equivalent for these in the C Standard Library. You would need to compute them yourself (the Wikipedia article on "machine epsilon" gives an example)...
As for the algorithm, can't help you there, and this wasn't part of your original question, sorry.

This depends entirely on the precision you desire from your numbers, the maximum value in a double is very large, but suffers from tremendous rounding errors. If you need a precision of 1e-3 for instance you need at least 10 bits after the floating point, meaning you should not have any exponent greater than the number of bits in the mantissa minus 10, in the case of a double, that is 52 - 10 = 42, leaving you with a maximum of about 4e12 and a corresponding minimum of about 2.5e-13.

Related

Why does adding 1 to numeric_limits<float>::min() return 1?

How come subtracting 1 from float max returns a sensible value, but adding 1 to float min returns 1?
I thought that if you added or subtracted a value smaller than the epsilon for that particular magnitude, then nothing would happen and there would be no increase or decrease.
Here is the code I compiled with g++ with no flags and ran on x86_64.
#include <limits>
#include <iostream>
int main() {
float min = std::numeric_limits<float>::min() + 1;
float max = std::numeric_limits<float>::max() - 1;
std::cout << min << std::endl << max << std::endl;
return 0;
}
Outputs this:
1
3.40282e+38
I would expect it to output this:
-3.40282e+38
3.40282e+38
std::numeric_limits<float>::min() returns the smallest normalized positive value. To get the value that has no value lower than it, use std::numeric_limits<float>::lowest().
https://en.cppreference.com/w/cpp/types/numeric_limits/min
min is the smallest-magnitude positive normalized float, a very tiny positive number (about 1.17549e-38), not a negative number with large magnitude. Notice that the - is in the exponent, and this is scientific notation. e-38 means 38 zeros after the decimal point. Try it out on https://www.h-schmidt.net/FloatConverter/IEEE754.html to play with the bits in a binary float.
std::numeric_limits<float>::min() is the minimum magnitude normalized float, not -max. CppReference even has a note about this possibly being surprising.
Do you know why that was picked to be the value for min() rather than the lowest negative value? Seems to be an outlier with regards to all the other types.
Some of the sophistication in numeric_limits<T> like lowest and denorm_min is new in C++11. Most of the choice of what to define mostly followed C. Historical C valued economy and didn't define a lot of different names. (Smaller is better on ancient computers, and also less stuff in the global namespace which is all C had access to.)
Float types are normally1 symmetric around 0 (sign/magnitude representation), so C didn't have a separate named constant for the most-negative float / double / long double. Just FLT_MAX and FLT_MIN CPP macros. C doesn't have templates, so you know when you're writing FP code and can use a - on the appropriate constant if necessary.
If you're only going to have a few named constants, the three most interesting ones are:
FLT_EPSILON tells you about the available precision (mantissa bits): nextafter(1.0, +INF) - 1.0
FLT_MIN / FLT_MAX min (normalized) and max magnitudes of finite floats. This depends mostly on how many exponent bits a float has.
They're not quite symmetric around 1.0 for 2 reasons: all-ones mantissa in FLT_MAX, and gradual underflow (subnormals) taking up the lowest exponent-field (0 with bias), but FLT_MIN ignoring subnormals. FLT_MIN * FLT_MAX is about 3.99999976 for IEEE754 binary32 float. (You normally want to avoid subnormals for performance reasons, and so you have room for gradual underflow, so it makes sense that FLT_MIN isn't denorm_min)
(Fun fact: 0.0 is a special case of a subnormal: exponent field = 0 implying a mantissa of 0.xxx instead of 1.xxx).
Footnote 1: CppReference points out that C++11 std::numeric_limits<T>::lowest() could be different from -max for 3rd-party FP types, but isn't for standard C++ FP types.
lowest is what you wanted: the most-negative finite value. It's consistent across integer and FP types as being the most-negative value, so for example you could use it as an initializer for a templated search loop that uses std::min to find the lowest value in an array.
C++11 also introduced denorm_min, the minimum positive subnormal aka denormal value for FP types. In IEEE754, the object representation has all bits 0 except for a 1 in the low bit of the mantissa.
The float result for 1.0 + 1.17549e-38 (after rounding to the nearest float) is exactly 1.0. min is lower than std::numeric_limits<float>::epsilon so the entire change is lost to rounding error when added to 1.0.
So even if you did print the float with full precision (or as a hex float), it would be 1.0. But you're just printing with the default formatting for cout which rounds to some limited precision, like 6 decimal digits. https://en.cppreference.com/w/cpp/io/manip/setprecision
(An earlier version of the question included the numeric value of min ~= 1.17549e-38; this answer started out addressing that mixup and I haven't bothered to fully rewrite those parts).

Unwanted division operator behavior, what should I do?

Problem description
During my fluid simulation, the physical time is marching as 0, 0.001, 0.002, ..., 4.598, 4.599, 4.6, 4.601, 4.602, .... Now I want to choose time = 0.1, 0.2, ..., 4.5, 4.6, ... from this time series and then do the further analysis. So I wrote the following code to judge if the fractpart hits zero.
But I am so surprised that I found the following two division methods are getting two different results, what should I do?
double param, fractpart, intpart;
double org = 4.6;
double ddd = 0.1;
// This is the correct one I need. I got intpart=46 and fractpart=0
// param = org*(1/ddd);
// This is not what I want. I got intpart=45 and fractpart=1
param = org/ddd;
fractpart = modf(param , &intpart);
Info<< "\n\nfractpart\t=\t"
<< fractpart
<< "\nAnd intpart\t=\t"
<< intpart
<< endl;
Why does it happen in this way?
And if you guys tolerate me a little bit, can I shout loudly: "Could C++ committee do something about this? Because this is confusing." :)
What is the best way to get a correct remainder to avoid the cut-off error effect? Is fmod a better solution? Thanks
Respond to the answer of
David Schwartz
double aTmp = 1;
double bTmp = 2;
double cTmp = 3;
double AAA = bTmp/cTmp;
double BBB = bTmp*(aTmp/cTmp);
Info<< "\n2/3\t=\t"
<< AAA
<< "\n2*(1/3)\t=\t"
<< BBB
<< endl;
And I got both ,
2/3 = 0.666667
2*(1/3) = 0.666667
Floating point values cannot exactly represent every possible number, so your numbers are being approximated. This results in different results when used in calculations.
If you need to compare floating point numbers, you should always use a small epsilon value rather than testing for equality. In your case I would round to the nearest integer (not round down), subtract that from the original value, and compare the abs() of the result against an epsilon.
If the question is, why does the sum differ, the simple answer is that they are different sums. For a longer explanation, here are the actual representations of the numbers involved:
org: 4.5999999999999996 = 0x12666666666666 * 2^-50
ddd: 0.10000000000000001 = 0x1999999999999a * 2^-56
1/ddd: 10 = 0x14000000000000 * 2^-49
org * (1/ddd): 46 = 0x17000000000000 * 2^-47
org / ddd: 45.999999999999993 = 0x16ffffffffffff * 2^-47
You will see that neither input value is exactly represented in a double, each having been rounded up or down to the nearest value. org has been rounded down, because the next bit in the sequence would be 0. ddd has been rounded up, because the next bit in that sequence would be a 1.
Because of this, when mathematical operations are performed the rounding can either cancel, or accumulate, depending on the operation and how the original numbers have been rounded.
In this case, 1/0.1 happens to round neatly back to exactly 10.
Multiplying org by 10 happens to round up.
Dividing org by ddd happens to round down (I say 'happens to', but you're dividing a rounded-down number by a rounded-up number, so it's natural that the result is less).
Different inputs will round differently.
It's only a single bit of error, which can be easily ignored with even a tiny epsilon.
If I understand your question correctly, it's this: Why, with limited-precision arithmetic, is X/Y not the same is X * (1/Y)?
And the reason is simple: Consider, for example, using six digits of decimal precision. While this is not what doubles actually do, the concept is precisely the same.
With six decimal digits, 1/3 is .333333. But 2/3 is .666667. So:
2 / 3 = .666667
2 * (1/3) = 2 * .333333 = .6666666
That's just the nature of fixed-precision math. If you can't tolerate this behavior, don't use limited-precision types.
Hm not really sure what you want to achieve, but if you want get a value and then want to
do some refine in the range of 1/1000, why not use integers instead of floats/doubles?
You would have a divisor, which is 1000, and have values that you iterate over that you need to multiply by your divisor.
So you would get something like
double org = ... // comes from somewhere
int divisor = 1000;
int referenceValue = org * div;
for (size_t step = referenceValue - 10; step < referenceValue + 10; ++step) {
// use (double) step / divisor to feed to your algorithm
}
You can't represent 4.6 precisely: http://www.binaryconvert.com/result_double.html?decimal=052046054
Use rounding before separating integer and fraction parts.
UPDATE
You may wish to use rational class from Boost library: http://www.boost.org/doc/libs/1_52_0/libs/rational/rational.html
CONCERNING YOUR TASK
To find required double take precision into account, for example, to find 4.6 calculate "closeness" to it:
double time;
...
double epsilon = 0.001;
if( abs(time-4.6) <= epsilon ) {
// found!
}

Can I trust a real-to-int conversion of the result of ceil()?

Suppose I have some code such as:
float a, b = ...; // both positive
int s1 = ceil(sqrt(a/b));
int s2 = ceil(sqrt(a/b)) + 0.1;
Is it ever possible that s1 != s2? My concern is when a/b is a perfect square. For example, perhaps a=100.0 and b=4.0, then the output of ceil should be 5.00000 but what if instead it is 4.99999?
Similar question: is there a chance that 100.0/4.0 evaluates to say 5.00001 and then ceil will round it up to 6.00000?
I'd prefer to do this in integer math but the sqrt kinda screws that plan.
EDIT: suggestions on how to better implement this would be appreciated too! The a and b values are integer values, so actual code is more like: ceil(sqrt(float(a)/b))
EDIT: Based on levis501's answer, I think I will do this:
float a, b = ...; // both positive
int s = sqrt(a/b);
while (s*s*b < a) ++s;
Thank you all!
I don't think it's possible. Regardless of the value of sqrt(a/b), what it produces is some value N that we use as:
int s1 = ceil(N);
int s2 = ceil(N) + 0.1;
Since ceil always produces an integer value (albeit represented as a double), we will always have some value X, for which the first produces X.0 and the second X.1. Conversion to int will always truncate that .1, so both will result in X.
It might seem like there would be an exception if X was so large that X.1 overflowed the range of double. I don't see where this could be possible though. Except close to 0 (where overflow isn't a concern) the square root of a number will always be smaller than the input number. Therefore, before ceil(N)+0.1 could overflow, the a/b being used as an input in sqrt(a/b) would have to have overflowed already.
You may want to write an explicit function for your case. e.g.:
/* return the smallest positive integer whose square is at least x */
int isqrt(double x) {
int y1 = ceil(sqrt(x));
int y2 = y1 - 1;
if ((y2 * y2) >= x) return y2;
return y1;
}
This will handle the odd case where the square root of your ratio a/b is within the precision of double.
Equality of floating point numbers is indeed an issue, but IMHO not if we deal with integer numbers.
If you have the case of 100.0/4.0, it should perfectly evaluate to 25.0, as 25.0 is exactly representable as a float, as opposite to e.g. 25.1.
Yes, it's entirely possible that s1 != s2. Why is that a problem, though?
It seems natural enough that s1 != (s1 + 0.1).
BTW, if you would prefer to have 5.00001 rounded to 5.00000 instead of 6.00000, use rint instead of ceil.
And to answer the actual question (in your comment) - you can use sqrt to get a starting point and then just find the correct square using integer arithmetic.
int min_dimension_greater_than(int items, int buckets)
{
double target = double(items) / buckets;
int min_square = ceil(target);
int dim = floor(sqrt(target));
int square = dim * dim;
while (square < min_square) {
seed += 1;
square = dim * dim;
}
return dim;
}
And yes, this can be improved a lot, it's just a quick sketch.
s1 will always equal s2.
The C and C++ standards do not say much about the accuracy of math routines. Taken literally, it is impossible for the standard to be implemented, since the C standard says sqrt(x) returns the square root of x, but the square root of two cannot be exactly represented in floating point.
Implementing routines with good performance that always return a correctly rounded result (in round-to-nearest mode, this means the result is the representable floating-point number that is nearest to the exact result, with ties resolved in favor of a low zero bit) is a difficult research problem. Good math libraries target accuracy less than 1 ULP (so one of the two nearest representable numbers is returned), perhaps something slightly more than .5 ULP. (An ULP is the Unit of Least Precision, the value of the low bit given a particular value in the exponent field.) Some math libraries may be significantly worse than this. You would have to ask your vendor or check the documentation for more information.
So sqrt may be slightly off. If the exact square root is an integer (within the range in which integers are exactly representable in floating-point) and the library guarantees errors are less than 1 ULP, then the result of sqrt must be exactly correct, because any result other than the exact result is at least 1 ULP away.
Similarly, if the library guarantees errors are less than 1 ULP, then ceil must return the exact result, again because the exact result is representable and any other result would be at least 1 ULP away. Additionally, the nature of ceil is such that I would expect any reasonable math library to always return an integer, even if the rest of the library were not high quality.
As for overflow cases, if ceil(x) were beyond the range where all integers are exactly representable, then ceil(x)+.1 is closer to ceil(x) than it is to any other representable number, so the rounded result of adding .1 to ceil(x) should be ceil(x) in any system implementing the floating-point standard (IEEE 754). That is provided you are in the default rounding mode, which is round-to-nearest. It is possible to change the rounding mode to something like round-toward-infinity, which could cause ceil(x)+.1 to be an integer higher than ceil(x).

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 → 2.679
12.3499999 → 12.35
1.20000001 → 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 → 3 decimal digits
6.15634 → 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 → 5 decimal digits
1.113 * 6.15634 = 6.85200642 → 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.

array, I/O file and standard deviation (c++)

double s_deviation(double data[],int cnt, double mean)
{
int i;
double sum= 0;
double sdeviation;
double x;
//x = mean(billy,a_size);
for(i=0; i<cnt; i++)
{
sum += ((data[i]) - (mean));
}
sdeviation = sqrt(sum/((double)cnt));
return sdeviation;
}
When I cout the result from this function, it gave me NaN.
I tested the value of (mean) and data[i] using
return data[i] and return mean
they are valid.
when i replaced mean with an actual number, the operation returned a finite number.
but with mean as a variable, it produced NaH.
I can't see anything wrong with my code at the moment.
Again, I am sure mean, data are getting the right number based on those tests.
Thank you
I'd guess that the value of mean is large relative to your data, so that some of the ((data[i]) - (mean)) values are negative, and so overall sum ends up being negative.
Then, when you try to compute sqrt(sum/((double)cnt)), you are taking the square root of a negative number, which results in complex number, which is not representable by a double.
However, the underlying problem is that your standard deviation algorithm is incorrect. You are supposed to sum the squares of the distances from the mean, not the distances themselves. Aside from making your computation correct, this also guarantees that sum is never negative, and so you can always get a real-valued square root.
I think You should have
for(i=0; i<cnt; i++)
{
sum += ((data[i]) - (mean)) * ((data[i]) - (mean));
}
In the version You have now sum should be 0, but due to some rounding errors it's most probably a small negative value.
You're taking the sqrt of a negative number (most likely), and that's because you're using the wrong formula for standard deviation.
Standard dev is not the sqrt of the average of (val-mean), it's the sqrt of the average SQUARE of (val-mean).