double to scientific notation conversion - precision error - c++

I'm writing a piece of code to convert double values to scientific notations upto a precision of 15 in C++. I know I can use standard libraries like sprintf with %e option to do this. But I would need to come out with my own solution.
I'm trying something like this.
double norm = 68600000;
if (norm)
{
while (norm >= 10.0)
{
norm /= 10.0;
exp++;
}
while (norm < 1.0)
{
norm *= 10.0;
exp--;
}
}
The result I get is
norm = 6.8599999999999994316;
exp = 7
The reason for loosing this precision I clarified from this question
Now I try to round the value to the precision of 15, which would result in
6.859 999 999 999 999
(its evident that since the 16th decimal point is less than 5 we get this result)
Expected answer: norm = 6.860 000 000 000 000, exp = 7
My question is, is there any better way for double to scientific notation conversion to the precision of 15(without using the standard libs), so that I would get exactly 6.86 when I round. If you have noticed the problem here is not with the rounding mechanism, but with the double to scientific notation conversion due to the precision loss related to machine epsilon

You could declare norm as a long double for some more precision. long double wiki Although there are some compiler specific issues to be aware of. Some compilers make long double synonymous with double.
Another way to go about solving this precision problem is to work with numbers in the form of strings and implement custom arithmetic operations for strings that would not be subject to machine epsilon.
For example:
int getEXP(string norm){ return norm.length() - 1; };
string norm = "68600000";
int exp = getEXP(norm); // returns 7
The next step would be to implement functions to insert a decimal character into the appropriate place in the norm string, and add whatever level of precision you'd like. No machine epsilon to worry about.

Related

How to round a floating point type to two decimals or more in C++? [duplicate]

How can I round a float value (such as 37.777779) to two decimal places (37.78) in C?
If you just want to round the number for output purposes, then the "%.2f" format string is indeed the correct answer. However, if you actually want to round the floating point value for further computation, something like the following works:
#include <math.h>
float val = 37.777779;
float rounded_down = floorf(val * 100) / 100; /* Result: 37.77 */
float nearest = roundf(val * 100) / 100; /* Result: 37.78 */
float rounded_up = ceilf(val * 100) / 100; /* Result: 37.78 */
Notice that there are three different rounding rules you might want to choose: round down (ie, truncate after two decimal places), rounded to nearest, and round up. Usually, you want round to nearest.
As several others have pointed out, due to the quirks of floating point representation, these rounded values may not be exactly the "obvious" decimal values, but they will be very very close.
For much (much!) more information on rounding, and especially on tie-breaking rules for rounding to nearest, see the Wikipedia article on Rounding.
Using %.2f in printf. It only print 2 decimal points.
Example:
printf("%.2f", 37.777779);
Output:
37.77
Assuming you're talking about round the value for printing, then Andrew Coleson and AraK's answer are correct:
printf("%.2f", 37.777779);
But note that if you're aiming to round the number to exactly 37.78 for internal use (eg to compare against another value), then this isn't a good idea, due to the way floating point numbers work: you usually don't want to do equality comparisons for floating point, instead use a target value +/- a sigma value. Or encode the number as a string with a known precision, and compare that.
See the link in Greg Hewgill's answer to a related question, which also covers why you shouldn't use floating point for financial calculations.
How about this:
float value = 37.777779;
float rounded = ((int)(value * 100 + .5) / 100.0);
printf("%.2f", 37.777779);
If you want to write to C-string:
char number[24]; // dummy size, you should take care of the size!
sprintf(number, "%.2f", 37.777779);
Always use the printf family of functions for this. Even if you want to get the value as a float, you're best off using snprintf to get the rounded value as a string and then parsing it back with atof:
#include <math.h>
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
double dround(double val, int dp) {
int charsNeeded = 1 + snprintf(NULL, 0, "%.*f", dp, val);
char *buffer = malloc(charsNeeded);
snprintf(buffer, charsNeeded, "%.*f", dp, val);
double result = atof(buffer);
free(buffer);
return result;
}
I say this because the approach shown by the currently top-voted answer and several others here -
multiplying by 100, rounding to the nearest integer, and then dividing by 100 again - is flawed in two ways:
For some values, it will round in the wrong direction because the multiplication by 100 changes the decimal digit determining the rounding direction from a 4 to a 5 or vice versa, due to the imprecision of floating point numbers
For some values, multiplying and then dividing by 100 doesn't round-trip, meaning that even if no rounding takes place the end result will be wrong
To illustrate the first kind of error - the rounding direction sometimes being wrong - try running this program:
int main(void) {
// This number is EXACTLY representable as a double
double x = 0.01499999999999999944488848768742172978818416595458984375;
printf("x: %.50f\n", x);
double res1 = dround(x, 2);
double res2 = round(100 * x) / 100;
printf("Rounded with snprintf: %.50f\n", res1);
printf("Rounded with round, then divided: %.50f\n", res2);
}
You'll see this output:
x: 0.01499999999999999944488848768742172978818416595459
Rounded with snprintf: 0.01000000000000000020816681711721685132943093776703
Rounded with round, then divided: 0.02000000000000000041633363423443370265886187553406
Note that the value we started with was less than 0.015, and so the mathematically correct answer when rounding it to 2 decimal places is 0.01. Of course, 0.01 is not exactly representable as a double, but we expect our result to be the double nearest to 0.01. Using snprintf gives us that result, but using round(100 * x) / 100 gives us 0.02, which is wrong. Why? Because 100 * x gives us exactly 1.5 as the result. Multiplying by 100 thus changes the correct direction to round in.
To illustrate the second kind of error - the result sometimes being wrong due to * 100 and / 100 not truly being inverses of each other - we can do a similar exercise with a very big number:
int main(void) {
double x = 8631192423766613.0;
printf("x: %.1f\n", x);
double res1 = dround(x, 2);
double res2 = round(100 * x) / 100;
printf("Rounded with snprintf: %.1f\n", res1);
printf("Rounded with round, then divided: %.1f\n", res2);
}
Our number now doesn't even have a fractional part; it's an integer value, just stored with type double. So the result after rounding it should be the same number we started with, right?
If you run the program above, you'll see:
x: 8631192423766613.0
Rounded with snprintf: 8631192423766613.0
Rounded with round, then divided: 8631192423766612.0
Oops. Our snprintf method returns the right result again, but the multiply-then-round-then-divide approach fails. That's because the mathematically correct value of 8631192423766613.0 * 100, 863119242376661300.0, is not exactly representable as a double; the closest value is 863119242376661248.0. When you divide that back by 100, you get 8631192423766612.0 - a different number to the one you started with.
Hopefully that's a sufficient demonstration that using roundf for rounding to a number of decimal places is broken, and that you should use snprintf instead. If that feels like a horrible hack to you, perhaps you'll be reassured by the knowledge that it's basically what CPython does.
Also, if you're using C++, you can just create a function like this:
string prd(const double x, const int decDigits) {
stringstream ss;
ss << fixed;
ss.precision(decDigits); // set # places after decimal
ss << x;
return ss.str();
}
You can then output any double myDouble with n places after the decimal point with code such as this:
std::cout << prd(myDouble,n);
There isn't a way to round a float to another float because the rounded float may not be representable (a limitation of floating-point numbers). For instance, say you round 37.777779 to 37.78, but the nearest representable number is 37.781.
However, you can "round" a float by using a format string function.
You can still use:
float ceilf(float x); // don't forget #include <math.h> and link with -lm.
example:
float valueToRound = 37.777779;
float roundedValue = ceilf(valueToRound * 100) / 100;
In C++ (or in C with C-style casts), you could create the function:
/* Function to control # of decimal places to be output for x */
double showDecimals(const double& x, const int& numDecimals) {
int y=x;
double z=x-y;
double m=pow(10,numDecimals);
double q=z*m;
double r=round(q);
return static_cast<double>(y)+(1.0/m)*r;
}
Then std::cout << showDecimals(37.777779,2); would produce: 37.78.
Obviously you don't really need to create all 5 variables in that function, but I leave them there so you can see the logic. There are probably simpler solutions, but this works well for me--especially since it allows me to adjust the number of digits after the decimal place as I need.
Use float roundf(float x).
"The round functions round their argument to the nearest integer value in floating-point format, rounding halfway cases away from zero, regardless of the current rounding direction." C11dr §7.12.9.5
#include <math.h>
float y = roundf(x * 100.0f) / 100.0f;
Depending on your float implementation, numbers that may appear to be half-way are not. as floating-point is typically base-2 oriented. Further, precisely rounding to the nearest 0.01 on all "half-way" cases is most challenging.
void r100(const char *s) {
float x, y;
sscanf(s, "%f", &x);
y = round(x*100.0)/100.0;
printf("%6s %.12e %.12e\n", s, x, y);
}
int main(void) {
r100("1.115");
r100("1.125");
r100("1.135");
return 0;
}
1.115 1.115000009537e+00 1.120000004768e+00
1.125 1.125000000000e+00 1.129999995232e+00
1.135 1.134999990463e+00 1.139999985695e+00
Although "1.115" is "half-way" between 1.11 and 1.12, when converted to float, the value is 1.115000009537... and is no longer "half-way", but closer to 1.12 and rounds to the closest float of 1.120000004768...
"1.125" is "half-way" between 1.12 and 1.13, when converted to float, the value is exactly 1.125 and is "half-way". It rounds toward 1.13 due to ties to even rule and rounds to the closest float of 1.129999995232...
Although "1.135" is "half-way" between 1.13 and 1.14, when converted to float, the value is 1.134999990463... and is no longer "half-way", but closer to 1.13 and rounds to the closest float of 1.129999995232...
If code used
y = roundf(x*100.0f)/100.0f;
Although "1.135" is "half-way" between 1.13 and 1.14, when converted to float, the value is 1.134999990463... and is no longer "half-way", but closer to 1.13 but incorrectly rounds to float of 1.139999985695... due to the more limited precision of float vs. double. This incorrect value may be viewed as correct, depending on coding goals.
Code definition :
#define roundz(x,d) ((floor(((x)*pow(10,d))+.5))/pow(10,d))
Results :
a = 8.000000
sqrt(a) = r = 2.828427
roundz(r,2) = 2.830000
roundz(r,3) = 2.828000
roundz(r,5) = 2.828430
double f_round(double dval, int n)
{
char l_fmtp[32], l_buf[64];
char *p_str;
sprintf (l_fmtp, "%%.%df", n);
if (dval>=0)
sprintf (l_buf, l_fmtp, dval);
else
sprintf (l_buf, l_fmtp, dval);
return ((double)strtod(l_buf, &p_str));
}
Here n is the number of decimals
example:
double d = 100.23456;
printf("%f", f_round(d, 4));// result: 100.2346
printf("%f", f_round(d, 2));// result: 100.23
I made this macro for rounding float numbers.
Add it in your header / being of file
#define ROUNDF(f, c) (((float)((int)((f) * (c))) / (c)))
Here is an example:
float x = ROUNDF(3.141592, 100)
x equals 3.14 :)
Let me first attempt to justify my reason for adding yet another answer to this question. In an ideal world, rounding is not really a big deal. However, in real systems, you may need to contend with several issues that can result in rounding that may not be what you expect. For example, you may be performing financial calculations where final results are rounded and displayed to users as 2 decimal places; these same values are stored with fixed precision in a database that may include more than 2 decimal places (for various reasons; there is no optimal number of places to keep...depends on specific situations each system must support, e.g. tiny items whose prices are fractions of a penny per unit); and, floating point computations performed on values where the results are plus/minus epsilon. I have been confronting these issues and evolving my own strategy over the years. I won't claim that I have faced every scenario or have the best answer, but below is an example of my approach so far that overcomes these issues:
Suppose 6 decimal places is regarded as sufficient precision for calculations on floats/doubles (an arbitrary decision for the specific application), using the following rounding function/method:
double Round(double x, int p)
{
if (x != 0.0) {
return ((floor((fabs(x)*pow(double(10.0),p))+0.5))/pow(double(10.0),p))*(x/fabs(x));
} else {
return 0.0;
}
}
Rounding to 2 decimal places for presentation of a result can be performed as:
double val;
// ...perform calculations on val
String(Round(Round(Round(val,8),6),2));
For val = 6.825, result is 6.83 as expected.
For val = 6.824999, result is 6.82. Here the assumption is that the calculation resulted in exactly 6.824999 and the 7th decimal place is zero.
For val = 6.8249999, result is 6.83. The 7th decimal place being 9 in this case causes the Round(val,6) function to give the expected result. For this case, there could be any number of trailing 9s.
For val = 6.824999499999, result is 6.83. Rounding to the 8th decimal place as a first step, i.e. Round(val,8), takes care of the one nasty case whereby a calculated floating point result calculates to 6.8249995, but is internally represented as 6.824999499999....
Finally, the example from the question...val = 37.777779 results in 37.78.
This approach could be further generalized as:
double val;
// ...perform calculations on val
String(Round(Round(Round(val,N+2),N),2));
where N is precision to be maintained for all intermediate calculations on floats/doubles. This works on negative values as well. I do not know if this approach is mathematically correct for all possibilities.
...or you can do it the old-fashioned way without any libraries:
float a = 37.777779;
int b = a; // b = 37
float c = a - b; // c = 0.777779
c *= 100; // c = 77.777863
int d = c; // d = 77;
a = b + d / (float)100; // a = 37.770000;
That of course if you want to remove the extra information from the number.
this function takes the number and precision and returns the rounded off number
float roundoff(float num,int precision)
{
int temp=(int )(num*pow(10,precision));
int num1=num*pow(10,precision+1);
temp*=10;
temp+=5;
if(num1>=temp)
num1+=10;
num1/=10;
num1*=10;
num=num1/pow(10,precision+1);
return num;
}
it converts the floating point number into int by left shifting the point and checking for the greater than five condition.

C++ int64 * double == off by one

Below is the code I've tested in a 64-bit environment and 32-bit. The result is off by one precisely each time. The expected result is: 1180000000 with the actual result being 1179999999. I'm not sure exactly why and I was hoping someone could educate me:
#include <stdint.h>
#include <iostream>
using namespace std;
int main() {
double odds = 1.18;
int64_t st = 1000000000;
int64_t res = st * odds;
cout << "result: " << res << endl;
return 1;
}
I appreciate any feedback.
1.18, or 118 / 100 can't be exactly represented in binary, it will have repeating decimals. The same happens if you write 1 / 3 in decimal.
So let's go over a similar case in decimal, let's calculate (1 / 3) × 30000, which of course should be 10000:
odds = 1 / 3 and st = 30000
Since computers have only a limited precision we have to truncate this number to a limited number of decimals, let's say 6, so:
odds = 0.333333
0.333333 × 10000 = 9999.99. The cast (which in your program is implicit) will truncate this number to 9999.
There is no 100% reliable way to work around this. float and double just have only limited precision. Dealing with this is a hard problem.
Your program contains an implicit cast from double to an integer on the line int64_t res = st * odds;. Many compilers will warn you about this. It can be the source of bugs of the type you are describing. This cast, which can be explicitly written as (int64_t) some_double, rounds the number towards zero.
An alternative is rounding to the nearest integer with round(some_double);. That will—in this case—give the expected result.
First of all - 1.18 is not exactly representable in double. Mathematically the result of:
double odds = 1.18;
is 1.17999999999999993782751062099 (according to an online calculator).
So, mathematically, odds * st is 1179999999.99999993782751062099.
But in C++, odds * st is an expression with type double. So your compiler has two options for implementing this:
Do the computation in double precision
Do the computation in higher precision and then round the result to double
Apparently, doing the computation in double precision in IEEE754 results in exactly 1180000000.
However, doing it in long double precision produces something more like 1179999999.99999993782751062099
Converting this to double is now implementation-defined as to whether it selects the next-highest or next-lowest value, but I believe it is typical for the next-lowest to be selected.
Then converting this next-lowest result to integer will truncate the fractional part.
There is an interesting blog post here where the author describes the behaviour of GCC:
It uses long double intermediate precision for x86 code (due to the x87 FPUs long double registers)
It uses actual types for x64 code (because the SSE/SSE2 FPU supports this more naturally)
According to the C++11 standard you should be able to inspect which intermediate precision is being used by outputting FLT_EVAL_METHOD from <cfloat>. 0 would mean actual values, 2 would mean long double is being used.

double precision error when converting to scientific notation

I'm building a program to to convert double values in to scientific value format(mantissa, exponent). Then I noticed the below
369.7900000000000 -> 3.6978999999999997428
68600000 -> 6.8599999999999994316
I noticed the same pattern for several other values also. The maximum fractional error is
0.000 000 000 000 001 = 1*e-15
I know the inaccuracy in representing double values in a computer. Can this be concluded that the maximum fractional error we would get is 1*e-15? What is significant about this?
I went through most of the questions on floating point precision problem in stack overflow, but I didnt see any about the maximum fractional error in 64 bits.
To be clear on the computation I do, I have mentioned my code snippet as well
double norm = 68600000;
if (norm)
{
while (norm >= 10.0)
{
norm /= 10.0;
exp++;
}
while (norm < 1.0)
{
norm *= 10.0;
exp--;
}
}
Now I get
norm = 6.8599999999999994316;
exp = 7
The number you are getting is related to the machine epsilon for the double data type.
A double is 64 bits long, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa fraction. A double's value is given by
1.mmmmm... * (2^exp)
With only 52 bits for the mantissa, any double value below 2^-52 will be completely lost when added to 1.0 due to its small significance. In binary, 1.0 + 2^-52 would be
1.000...00 + 0.000...01 = 1.000.....01
Obviously anything lower would not change the value of 1.0. You can verify for yourself that 1.0 + 2^-53 == 1.0 in a program.
This number 2^-52 = 2.22e-16 is called the machine epsilon and is an upper bound on the relative error that occurs during one floating point arithmetic due to round-off error with double values.
Similarly, float has 23 bits in its mantissa and so its machine epsilon is 2^-23 = 1.19e-7.
The reason you are getting 1e-15 may be because errors accumulate as you perform many arithmetic operations, but I can't say because I don't know the exact calculations you are doing.
EDIT: I've looked into the relative error for your problem with 68600000.
First off, you may be interested to know that round-off error can change the result of your computation if you break it into steps:
686.0/10.0 = 68.59999999999999431566
686.0/10.0/10.0 = 6.85999999999999943157
686.0/100.0 = 6.86000000000000031974
In the first line, the closest double to 68.6 is lower than the actual value, but in the third line we see the closest double to 6.86 is greater.
If we look at the abosolute error e_abs = abs(v-v_approx) of your program, we see that it is
6.8600000 - 6.85999999999999943156581139192 ~= 5.684e-16
However, the relative error e_abs = abs( (v-v_approx)/ v) = abs(e_abs/v) would be
5.684e-16 / 6.86 ~= 8.286e-17
Which is indeed below our machine epsilon of 2.22e-16.
This is a famous paper you can read if you want to know all the details about floating point arithmetic.

Why float taking 0.699999 instead of 0.7 [duplicate]

This question already has answers here:
Floating point comparison [duplicate]
(5 answers)
Closed 9 years ago.
Here x is taking 0.699999 instead of 0.7 but y is taking 0.5 as assigned. Can you tell me what is the exact reason for this behavior.
#include<iostream>
using namespace std;
int main()
{
float x = 0.7;
float y = 0.5;
if (x < 0.7)
{
if (y < 0.5)
cout<<"2 is right"<<endl;
else
cout<<"1 is right"<<endl;
}
else
cout<<"0 is right"<<endl;
cin.get();
return 0;
}
There are lots of things on the internet about IEEE floating point.
0.5 = 1/2
so can be written exactly as a sum of powers of two
0.7 = 7/10 = 1/2 + 1/5 = 1/2 + 1/8 + a bit more... etc
The bit more can never be exactly a power of two, so you get the closest it can manage.
It is to do with how floating points are represented in memory. They have a limited number of bits (usually 32 for a float). This means there are a limited number of values that can be represented which means that many numbers from the infinite set of real numbers cannot be represented.
This website explains further
If you want to understand exactly why, then have a look at floating point representation of your machine (most probably it's IEEE 754, https://en.wikipedia.org/wiki/IEEE_floating_point ).
If you want to write robust and portable code, never compare floating-point values for equality. You should always compare them with some precision (e.g. instead of x==y you should write fabs(x-y) < eps where eps is say 1e-6).
floating point representation is approximate only as you cannot have precise representation of real, non-rational numbers on a computer.
`
when operating on floats, errros will in general accumulate.
however, there are some reals which can be represented exactly on a digital computer using it's native datatype for this purpose (*), 0.5 being one of them.
(*) meaning the format the floating point processing unit of the cpu operates on (standardized in ieee754). specialized libraries can represent integer and rational numbers exactly beyond the limits of the processor's internal formats. rounding errors may still occur when converting into a human-readable decimal expansion and the alternative also does not extend to irrational numbers (e.g. sqrt(3)). and, of course, these libraries comes at the cost of less speed.

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.
"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}
For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.
I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/
So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.
A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.
Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.
I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}
What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}
Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);
There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 → 2.679
12.3499999 → 12.35
1.20000001 → 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 → 3 decimal digits
6.15634 → 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 → 5 decimal digits
1.113 * 6.15634 = 6.85200642 → 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel
As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.
Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.