double precision error when converting to scientific notation

double precision error when converting to scientific notation - c++

I'm building a program to to convert double values in to scientific value format(mantissa, exponent). Then I noticed the below
369.7900000000000 -> 3.6978999999999997428
68600000 -> 6.8599999999999994316
I noticed the same pattern for several other values also. The maximum fractional error is
0.000 000 000 000 001 = 1*e-15
I know the inaccuracy in representing double values in a computer. Can this be concluded that the maximum fractional error we would get is 1*e-15? What is significant about this?
I went through most of the questions on floating point precision problem in stack overflow, but I didnt see any about the maximum fractional error in 64 bits.
To be clear on the computation I do, I have mentioned my code snippet as well
double norm = 68600000;
if (norm)
{
while (norm >= 10.0)
{
norm /= 10.0;
exp++;
}
while (norm < 1.0)
{
norm *= 10.0;
exp--;
}
}
Now I get
norm = 6.8599999999999994316;
exp = 7

The number you are getting is related to the machine epsilon for the double data type.
A double is 64 bits long, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa fraction. A double's value is given by
1.mmmmm... * (2^exp)
With only 52 bits for the mantissa, any double value below 2^-52 will be completely lost when added to 1.0 due to its small significance. In binary, 1.0 + 2^-52 would be
1.000...00 + 0.000...01 = 1.000.....01
Obviously anything lower would not change the value of 1.0. You can verify for yourself that 1.0 + 2^-53 == 1.0 in a program.
This number 2^-52 = 2.22e-16 is called the machine epsilon and is an upper bound on the relative error that occurs during one floating point arithmetic due to round-off error with double values.
Similarly, float has 23 bits in its mantissa and so its machine epsilon is 2^-23 = 1.19e-7.
The reason you are getting 1e-15 may be because errors accumulate as you perform many arithmetic operations, but I can't say because I don't know the exact calculations you are doing.
EDIT: I've looked into the relative error for your problem with 68600000.
First off, you may be interested to know that round-off error can change the result of your computation if you break it into steps:
686.0/10.0 = 68.59999999999999431566
686.0/10.0/10.0 = 6.85999999999999943157
686.0/100.0 = 6.86000000000000031974
In the first line, the closest double to 68.6 is lower than the actual value, but in the third line we see the closest double to 6.86 is greater.
If we look at the abosolute error e_abs = abs(v-v_approx) of your program, we see that it is
6.8600000 - 6.85999999999999943156581139192 ~= 5.684e-16
However, the relative error e_abs = abs( (v-v_approx)/ v) = abs(e_abs/v) would be
5.684e-16 / 6.86 ~= 8.286e-17
Which is indeed below our machine epsilon of 2.22e-16.
This is a famous paper you can read if you want to know all the details about floating point arithmetic.

Related

comparison float rounding fails System.Math.RoundTo C++ XE7

I've been trying to round a float value to 4 precision without success.
float fconv = 1.0f;
float fdata = 39.934543423412f;
float fres = RoundTo(fdata*fconv, -4);
if(fres <= 39.9345f){do something;} //<-- unwanted behavior
Wanted result is 39.934500000000
Actual result is 39.934543423412
I've tried many methods including Round a float to a given precision without success.
I'm working on an AMD FX83xx 64bit. Program is built in 32bit Debug using XE7
Thanks

Your desired precision of 6 decimal digits is very near the precision limits for a float data type. The epsilon, or delta between consecutive representable floating point values, for a number around 40f is about 7.63E-6, so there's only a couple of bits different between the 'best' value and what you're getting. This is possibly due to rounding that close to the limit, but I'm not sure.

double to scientific notation conversion - precision error

I'm writing a piece of code to convert double values to scientific notations upto a precision of 15 in C++. I know I can use standard libraries like sprintf with %e option to do this. But I would need to come out with my own solution.
I'm trying something like this.
double norm = 68600000;
if (norm)
{
while (norm >= 10.0)
{
norm /= 10.0;
exp++;
}
while (norm < 1.0)
{
norm *= 10.0;
exp--;
}
}
The result I get is
norm = 6.8599999999999994316;
exp = 7
The reason for loosing this precision I clarified from this question
Now I try to round the value to the precision of 15, which would result in
6.859 999 999 999 999
(its evident that since the 16th decimal point is less than 5 we get this result)
Expected answer: norm = 6.860 000 000 000 000, exp = 7
My question is, is there any better way for double to scientific notation conversion to the precision of 15(without using the standard libs), so that I would get exactly 6.86 when I round. If you have noticed the problem here is not with the rounding mechanism, but with the double to scientific notation conversion due to the precision loss related to machine epsilon

You could declare norm as a long double for some more precision. long double wiki Although there are some compiler specific issues to be aware of. Some compilers make long double synonymous with double.
Another way to go about solving this precision problem is to work with numbers in the form of strings and implement custom arithmetic operations for strings that would not be subject to machine epsilon.
For example:
int getEXP(string norm){ return norm.length() - 1; };
string norm = "68600000";
int exp = getEXP(norm); // returns 7
The next step would be to implement functions to insert a decimal character into the appropriate place in the norm string, and add whatever level of precision you'd like. No machine epsilon to worry about.

Converting a decimal number in scientific notation to IEEE 754

I've read a few texts and threads showing how to convert from a decimal to IEEE 754 but I am still confused as to how I can convert the number without expanding the decimal (which is represented in scientific notation)
The number I am particularly working with is 9.07 * 10^23, but any number would do; I will figure out how to do it for my particular example.

I'm assuming you want the result to be the floating-point number closest to the decimal number, and that you are using double-precision floating-point numbers.
For most numbers, there is a way to do it relatively quickly. Here's how it works in a nutshell.
You need to split the number into either a product or a fraction of numbers that have an exact representation as a floating-point number. The largest power of 10 that is exactly representable is 10^22. So, to get 9.07e+23 in floating-point form, we can write:
9.07e+23 = 907 * 10^21
According to the IEEE-754 standard, a single floating-point operation is guaranteed to be correctly rounded, so the above product, computed as a product of 2 double precision floating-point numbers, will give the correctly rounded result.
If you were to use this in a conversion function, you would probably store the powers of 10 in an array.
Note that you can't use this method for 9.07e-23. This number equals 907 / 10^23, so the denominator would be too large to be exactly representable. In this situation, and other dealings with very large or very small numbers, you have to use some form of high-precision arithmetic.
See Fast Path Decimal to Floating-Point Conversion for further details and examples.

Converting a number from a decimal string to binary IEEE is fairly straight-forward if you know how to do IEEE floating-point addition and multiplication. (or if you're using any basic programming language like C/C++)
There's a lot of different approaches to this, but the easiest is to evaluate 9.07 * 10^23 directly.
First, start with 9.07:
9.07 = 9 + 0 * 10^-1 + 7 * 10^-2
Now evaluate 10^23. This can be done by starting with 10 and using any powering algorithm.
Then multiply the results together.
Here's a simple implementation in C/C++:
double mantissa = 9;
mantissa += 0 / 10.;
mantissa += 7 / 100.;
double exp = 1;
for (int i = 0; i < 23; i++){
exp *= 10;
}
double result = mantissa * exp;
Now, going backwards (IEEE -> to decimal) is a lot harder.
Again, there's also a lot of different approaches. Here's the easiest one I can think of it.
I'll use 1.0011101b * 2^40 as the example. (the mantissa is in binary)
First, convert the mantissa to decimal: (this should be easy, since there's no exponent)
1.0011101b * 2^40 = 1.22656 * 2^40
Now, "scale" the number such that the binary exponent vanishes. This is done by multiplying by an appropriate power of 10 to "get rid" of the binary exponent.
1.22656 * 2^40 = 1.22656 * (2^40 * 10^-12) * 10^12
= 1.22656 * (1.09951) * 10^12
= 1.34861 * 10^12
So the answer is:
1.0011101b * 2^40 = 1.34861 * 10^12
In this example, 10^12 was needed to "scale away" the 2^40. Determining the power of 10 that is needed is simply equal to:
power of 10 = (power of 2) * log(2)/log(10)

How does Excel successfully round floating point numbers even though they are imprecise?

For example, this blog says 0.005 is not exactly 0.005, but rounding that number yields the right result.
I have tried all kinds of rounding in C++ and it fails when rounding numbers to certain decimal places. For example, Round(x,y) rounds x to a multiple of y. So Round(37.785,0.01) should give you 37.79 and not 37.78.
I am reopening this question to ask the community for help. The problem is with the impreciseness of floating point numbers (37,785 is represented as 37.78499999999).
The question is how does Excel get around this problem?
The solution in this round() for float in C++ is incorrect for the above problem.

"Round(37.785,0.01) should give you 37.79 and not 37.78."
First off, there is no consensus that 37.79 rather than 37.78 is the "right" answer here? Tie-breakers are always a bit tough. While always rounding up in the case of a tie is a widely-used approach, it certainly is not the only approach.
Secondly, this isn't a tie-breaking situation. The numerical value in the IEEE binary64 floating point format is 37.784999999999997 (approximately). There are lots of ways to get a value of 37.784999999999997 besides a human typing in a value of 37.785 and happen to have that converted to that floating point representation. In most of these cases, the correct answer is 37.78 rather than 37.79.
Addendum
Consider the following Excel formulae:
=ROUND(37785/1000,2)
=ROUND(19810222/2^19+21474836/2^47,2)
Both cells will display the same value, 37.79. There is a legitimate argument over whether 37785/1000 should round to 37.78 or 37.79 with two place accuracy. How to deal with these corner cases is a bit arbitrary, and there is no consensus answer. There isn't even a consensus answer inside Microsoft: "the Round() function is not implemented in a consistent fashion among different Microsoft products for historical reasons." ( http://support.microsoft.com/kb/196652 ) Given an infinite precision machine, Microsoft's VBA would round 37.785 to 37.78 (banker's round) while Excel would yield 37.79 (symmetric arithmetic round).
There is no argument over the rounding of the latter formula. It is strictly less than 37.785, so it should round to 37.78, not 37.79. Yet Excel rounds it up. Why?
The reason has to do with how real numbers are represented in a computer. Microsoft, like many others, uses the IEEE 64 bit floating point format. The number 37785/1000 suffers from precision loss when expressed in this format. This precision loss does not occur with 19810222/2^19+21474836/2^47; it is an "exact number".
I intentionally constructed that exact number to have the same floating point representation as does the inexact 37785/1000. That Excel rounds this exact value up rather than down is the key to determining how Excel's ROUND() function works: It is a variant of symmetric arithmetic rounding. It rounds based on a comparison to the floating point representation of the corner case.
The algorithm in C++:
#include <cmath> // std::floor
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Round the same way Excel does.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double excel_round (double x, int nplaces) {
bool is_neg = false;
// Excel uses symmetric arithmetic round: Round away from zero.
// The algorithm will be easier if we only deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the nearest rounded values and the nasty corner case.
// Note: We really do not want an optimizing compiler to put the corner
// case in an extended double precision register. Hence the volatile.
double round_down, round_up;
volatile double corner_case;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x * scale);
corner_case = (round_down + 0.5) / scale;
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x / scale);
corner_case = (round_down + 0.5) * scale;
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
// Round by comparing to the corner case.
x = (x < corner_case) ? round_down : round_up;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}

For very accurate arbitrary precision and rounding of floating point numbers to a fixed set of decimal places, you should take a look at a math library like GNU MPFR. While it's a C-library, the web-page I posted also links to a couple different C++ bindings if you want to avoid using C.
You may also want to read a paper entitled "What every computer scientist should know about floating point arithmetic" by David Goldberg at the Xerox Palo Alto Research Center. It's an excellent article demonstrating the underlying process that allows floating point numbers to be approximated in a computer that represents everything in binary data, and how rounding errors and other problems can creep up in FPU-based floating point math.

I don't know how Excel does it, but printing floating point numbers nicely is a hard problem: http://www.serpentine.com/blog/2011/06/29/here-be-dragons-advances-in-problems-you-didnt-even-know-you-had/

So your actual question seems to be, how to get correctly rounded floating point -> string conversions. By googling for those terms you'll get a bunch of articles, but if you're interested in something to use, most platforms provide reasonably competent implementations of sprintf()/snprintf(). So just use those, and if you find bugs, file a report to the vendor.

A function that takes a floating point number as argument and returns another floating point number, rounded exactly to a given number of decimal digits cannot be written, because there are many numbers with a finite decimal representation that have an infinite binary representation; one of the simplest examples is 0.1 .
To achieve what you want you must accept to use a different type as a result of your rounding function. If your immediate need is printing the number you can use a string and a formatting function: the problem becomes how to obtain exactly the formatting you expect. Otherwise if you need to store this number in order to perform exact calculations on it, for instance if you are doing accounting, you need a library that's capable of representing decimal numbers exactly. In this case the most common approach is to use a scaled representation: an integer for the value together with the number of decimal digits. Dividing the value by ten raised to the scale gives you the original number.
If any of these approaches is suitable, I'll try and expand my answer with practical suggestions.

Excel rounds numbers like this "correctly" by doing WORK. They started in 1985, with a fairly "normal" set of floating-point routines, and added some scaled-integer fake floating point, and they've been tuning those things and adding special cases ever since. The app DID used to have most of the same "obvious" bugs that everybody else did, it's just that it mostly had them a long time ago. I filed a couple myself, back when I was doing tech support for them in the early 90s.

I believe the following C# code rounds numbers as they are rounded in Excel. To exactly replicate the behavior in C++ you might need to use a special decimal type.
In plain English, the double-precision number is converted to a decimal and then rounded to fifteen significant digits (not to be confused with fifteen decimal places). The result is rounded a second time to the specified number of decimal places.
That might seem weird, but what you have to understand is that Excel always displays numbers that are rounded to 15 significant figures. If the ROUND() function weren't using that display value as a starting point, and used the internal double representation instead, then there would be cases where ROUND(A1,N) did not seem to correspond to the actual value in A1. That would be very confusing to a non-technical user.
The double which is closest to 37.785 has an exact decimal value of 37.784999999999996589394868351519107818603515625. (Any double can be represented precisely by a finite base ten decimal because one quarter, one eighth, one sixteenth, and so forth all have finite decimal expansions.) If that number were rounded directly to two decimal places, there would be no tie to break and the result would be 37.78. If you round to 15 significant figures first you get 37.7850000000000. If this is further rounded to two decimal places, then you get 37.79, so there is no real mystery after all.
// Convert to a floating decimal point number, round to fifteen
// significant digits, and then round to the number of places
// indicated.
static decimal SmartRoundDouble(double input, int places)
{
int numLeadingDigits = (int)Math.Log10(Math.Abs(input)) + 1;
decimal inputDec = GetAccurateDecimal(input);
inputDec = MoveDecimalPointRight(inputDec, -numLeadingDigits);
decimal round1 = Math.Round(inputDec, 15);
round1 = MoveDecimalPointRight(round1, numLeadingDigits);
decimal round2 = Math.Round(round1, places, MidpointRounding.AwayFromZero);
return round2;
}
static decimal MoveDecimalPointRight(decimal d, int n)
{
if (n > 0)
for (int i = 0; i < n; i++)
d *= 10.0m;
else
for (int i = 0; i > n; i--)
d /= 10.0m;
return d;
}
// The constructor for decimal that accepts a double does
// some rounding by default. This gets a more exact number.
static decimal GetAccurateDecimal(double r)
{
string accurateStr = r.ToString("G17", CultureInfo.InvariantCulture);
return Decimal.Parse(accurateStr, CultureInfo.InvariantCulture);
}

What you NEED is this :
double f = 22.0/7.0;
cout.setf(ios::fixed, ios::floatfield);
cout.precision(6);
cout<<f<<endl;
How it can be implemented (just a overview for rounding last digit)
:
long getRoundedPrec(double d, double precision = 9)
{
precision = (int)precision;
stringstream s;
long l = (d - ((double)((int)d)))* pow(10.0,precision+1);
int lastDigit = (l-((l/10)*10));
if( lastDigit >= 5){
l = l/10 +1;
}
return l;
}

Just as base-10 numbers must be rounded as they are converted to base-2, it is possible to round a number as it is converted from base-2 to base-10. Once the number has a base-10 representation it can be rounded again in a straightforward manner by looking at the digit to the right of the one you wish to round.
While there's nothing wrong with the above assertion, there's a much more pragmatic solution. The problem is that the binary representation tries to get as close as possible to the decimal number, even if that binary is less than the decimal. The amount of error is within [-0.5,0.5] least significant bits (LSB) of the true value. For rounding purposes you'd rather it be within [0,1] LSB so that the error is always positive, but that's not possible without changing all the rules of floating point math.
The one thing you can do is add 1 LSB to the value, so the error is within [0.5,1.5] LSB of the true value. This is less accurate overall, but only by a very tiny amount; when the value is rounded for representation as a decimal number it is much more likely to be rounded to a proper decimal number because the error is always positive.
To add 1 LSB to the value before rounding it, see the answers to this question. For example in Visual Studio C++ 2010 the procedure would be:
Round(_nextafter(37.785,37.785*1.1),0.01);

There are many ways to optimize the result of a floating-point value using statistical, numerical... algorithms
The easiest one is probably searching for repetitive 9s or 0s in the range of precision. If there are any, maybe those 9s are redundant, just round them up. But this may not work in many cases. Here's an example for a float with 6 digits of precision:
2.67899999 → 2.679
12.3499999 → 12.35
1.20000001 → 1.2
Excel always limits the input range to 15 digits and rounds the output to maximum 15 digits so this might be one of the way Excel uses
Or you can include the precision along with the number. After each step, adjust the accuracy depend on the precision of operands. For example
1.113 → 3 decimal digits
6.15634 → 5 decimal digits
Since both number are inside the double's 16-17 digits precision range, their sum will be accurate to the larger of them, which is 5 digits. Similarly, 3+5 < 16, so their product will be precise to 8 decimal numbers
1.113 + 6.15634 = 7.26934 → 5 decimal digits
1.113 * 6.15634 = 6.85200642 → 8 decimal digits
But 4.1341677841 * 2.251457145 will only take double's accuracy because the real result exceed double's precision
Another efficient algorithm is Grisu but I haven't had an opportunity to try.
In 2010, Florian Loitsch published a wonderful paper in PLDI, "Printing floating-point numbers quickly and accurately with integers", which represents the biggest step in this field in 20 years: he mostly figured out how to use machine integers to perform accurate rendering! Why do I say "mostly"? Because although Loitsch's "Grisu3" algorithm is very fast, it gives up on about 0.5% of numbers, in which case you have to fall back to Dragon4 or a derivative
Here be dragons: advances in problems you didn’t even know you had
In fact I think Excel must combine many different methods to achieve the best result of all
Example When a Value Reaches Zero
In Excel 95 or earlier, enter the following into a new workbook:
A1: =1.333+1.225-1.333-1.225
Right-click cell A1, and then click Format Cells. On the Number tab, click Scientific under Category. Set the Decimal places to 15.
Rather than displaying 0, Excel 95 displays -2.22044604925031E-16.
Excel 97, however, introduced an optimization that attempts to correct for this problem. Should an addition or subtraction operation result in a value at or very close to zero, Excel 97 and later will compensate for any error introduced as a result of converting an operand to and from binary. The example above when performed in Excel 97 and later correctly displays 0 or 0.000000000000000E+00 in scientific notation.
Floating-point arithmetic may give inaccurate results in Excel

As mjfgates says, Excel does hard work to get this "right". The first thing to do when you try to reimplement this, is define what you mean by "right". Obvious solutions:
implement rational arithmetic
Slow but reliable.
implement a bunch of heuristics
Fast but tricky to get right (think "years of bug reports").
It really depends on your application.

Most decimal fractions can't be accurately represented in binary.
double x = 0.0;
for (int i = 1; i <= 10; i++)
{
x += 0.1;
}
// x should now be 1.0, right?
//
// it isn't. Test it and see.
One solution is to use BCD. It's old. But, it's also tried and true. We have a lot of other old ideas that we use every day (like using a 0 to represent nothing...).
Another technique uses scaling upon input/output. This has the advantage of nearly all math being integer math.

Detecting precision loss when converting from double to float

I am writing a piece of code in which i have to convert from double to float values. I am using boost::numeric_cast to do this conversion which will alert me of any overflow/underflow. However i am also interested in knowing if that conversion resulted in some precision loss or not.
For example
double source = 1988.1012;
float dest = numeric_cast<float>(source);
Produces dest which has value 1988.1
Is there any way available in which i can detect this kind of precision loss/rounding

You could cast the float back to a double and compare this double to the original - that should give you a fair indication as to whether there was a loss of precision.

float dest = numeric_cast<float>(source);
double residual = source - numeric_cast<double>(dest);
Hence, residual contains the "loss" you're looking for.

Look at these articles for single precision and double precision floats. First of all, floats have 8 bits for the exponent vs. 11 for a double. So anything bigger than 10^127 or smaller than 10^-126 in magnitude is going to be the overflow as you mentioned. For the float, you have 23 bits for the actual digits of the number, vs 52 bits for the double. So obviously, you have a lot more digits of precision for the double than float.
Say you have a number like: 1.1123. This number may not actually be encoded as 1.1123 because the digits in a floating point number are used to actually add up as fractions. For example, if your bits in the mantissa were 11001, then the value would be formed by 1 (implicit) + 1 * 1/2 + 1 * 1/4 + 0 * 1/8 + 0 * 1/16 + 1 * 1/32 + 0 * (64 + 128 + ...). So the exact value cannot be encoded unless you can add up these fractions in such a way that it's the exact number. This is rare. Therefore, there will almost always be a precision loss.

You're going to have a certain level of precision loss, as per Dave's answer. If, however, you want to focus on quantifying it and raising an exception when it exceeds a certain number, you will have to open up the floating point number itself and parse out the mantissa & exponent, then do some analysis to determine if you've exceeded your tolerance.
But, the good news, its usually the standard IEEE floating-point float. :-)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js