Unexpected loss of precision when dividing doubles - c++

I have a function getSlope which takes as parameters 4 doubles and returns another double calculated using this given parameters in the following way:
double QSweep::getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
The problem is that when calling this function with arguments for example:
getSlope(2.71156, -1.64161, 2.70413, -1.72219);
the returned result is:
10.8557
and this is not a good result for my computations.
I have calculated the slope using Mathematica and the result for the slope for the same parameters is:
10.8452
or with more digits for precision:
10.845222072678331.
The result returned by my program is not good in my further computations.
Moreover, I do not understant how does the program returns 10.8557 starting from 10.845222072678331 (supposing that this is the approximate result for the division)?
How can I get the good result for my division?
thank you in advance,
madalina
I print the result using the command line:
std::cout<<slope<<endl;
It may be that my parameters are maybe not good, as I read them from another program (which computes a graph; after I read this parameters fromt his graph I have just displayed them to see their value but maybe the displayed vectors have not the same internal precision for the calculated value..I do not know it is really strange. Some numerical errors appears..)
When the graph from which I am reading my parameters is computed, some numerical libraries written in C++ (with templates) are used. No OpenGL is used for this computation.
thank you,
madalina

I've tried with float instead of double and I get 10.845110 as a result. It still looks better than madalina result.
EDIT:
I think I know why you get this results. If you get a, b, c and d parameters from somewhere else and you print it, it gives you rounded values. Then if you put it to Mathemtacia (or calc ;) ) it will give you different result.
I tried changing a little bit one of your parameters. When I did:
double c = 2.7041304;
I get 10.845806. I only add 0.0000004 to c!
So I think your "errors" aren't errors. Print a, b, c and d with better precision and then put them to Mathematica.

The following code:
#include <iostream>
using namespace std;
double getSlope(double a, double b, double c, double d){
double slope;
slope=(d-b)/(c-a);
return slope;
}
int main( ) {
double s = getSlope(2.71156, -1.64161, 2.70413, -1.72219);
cout << s << endl;
}
gives a result of 10.8452 with g++. How are you printing out the result in your code?

Could it be that you use DirectX or OpenGL in your project? If so they can turn off double precision and you will get strange results.
You can check your precision settings with
std::sqrt(x) * std::sqrt(x)
The result has to be pretty close to x.
I met this problem long time ago and spend a month checking all the formulas. But then I've found
D3DCREATE_FPU_PRESERVE

The problem here is that (c-a) is small, so the rounding errors inherent in floating point operations is magnified in this example. A general solution is to rework your equation so that you're not dividing by a small number, I'm not sure how you would do it here though.
EDIT:
Neil is right in his comment to this question, I computed the answer in VB using Doubles and got the same answer as mathematica.

The results you are getting are consistent with 32bit arithmetic. Without knowing more about your environment, it's not possible to advise what to do.
Assuming the code shown is what's running, ie you're not converting anything to strings or floats, then there isn't a fix within C++. It's outside of the code you've shown, and depends on the environment.
As Patrick McDonald and Treb brought both up the accuracy of your inputs and the error on a-c, I thought I'd take a look at that. One technique to look at rounding errors is interval arithmetic, which makes the upper and lower bounds which value represents explicit (they are implicit in floating point numbers, and are fixed to the precision of the representation). By treating each value as an upper and lower bound, and by extending the bounds by the error in the representation ( approx x * 2 ^ -53 for a double value x ), you get a result which gives the lower and upper bounds on the accuracy of a value, taking into account worst case precision errors.
For example, if you have a value in the range [1.0, 2.0] and subtract from it a value in the range [0.0, 1.0], then the result must lie in the range [below(0.0),above(2.0)] as the minimum result is 1.0-1.0 and the maximum is 2.0-0.0. below and above are equivalent to floor and ceiling, but for the next representable value rather than for integers.
Using intervals which represent worst-case double rounding:
getSlope(
a = [2.7115599999999995262:2.7115600000000004144],
b = [-1.6416099999999997916:-1.6416100000000002357],
c = [2.7041299999999997006:2.7041300000000005888],
d = [-1.7221899999999998876:-1.7221900000000003317])
(d-b) = [-0.080580000000000526206:-0.080579999999999665783]
(c-a) = [-0.0074300000000007129439:-0.0074299999999989383218]
to double precision [10.845222072677243474:10.845222072679954195]
So although c-a is small compared to c or a, it is still large compared to double rounding, so if you were using the worst imaginable double precision rounding, then you could trust that value's to be precise to 12 figures - 10.8452220727. You've lost a few figures off double precision, but you're still working to more than your input's significance.
But if the inputs were only accurate to the number significant figures, then rather than being the double value 2.71156 +/- eps, then the input range would be [2.711555,2.711565], so you get the result:
getSlope(
a = [2.711555:2.711565],
b = [-1.641615:-1.641605],
c = [2.704125:2.704135],
d = [-1.722195:-1.722185])
(d-b) = [-0.08059:-0.08057]
(c-a) = [-0.00744:-0.00742]
to specified accuracy [10.82930108:10.86118598]
which is a much wider range.
But you would have to go out of your way to track the accuracy in the calculations, and the rounding errors inherent in floating point are not significant in this example - it's precise to 12 figures with the worst case double precision rounding.
On the other hand, if your inputs are only known to 6 figures, it doesn't actually matter whether you get 10.8557 or 10.8452. Both are within [10.82930108:10.86118598].

Better Print out the arguments, too. When you are, as I guess, transferring parameters in decimal notation, you will lose precision for each and every one of them. The problem being that 1/5 is an infinite series in binary, so e.g. 0.2 becomes .001001001.... Also, decimals are chopped when converting an binary float to a textual representation in decimal.
Next to that, sometimes the compiler chooses speed over precision. This should be a documented compiler switch.

Patrick seems to be right about (c-a) being the main cause:
d-b = -1,72219 - (-1,64161) = -0,08058
c-a = 2,70413 - 2,71156 = -0,00743
S = (d-b)/(c-a)= -0,08058 / -0,00743 = 10,845222
You start out with six digits precision, through the subtraction you get a reduction to 3 and four digits. My best guess is that you loose additonal precision because the number -0,00743 can not be represented exaclty in a double. Try using intermediate variables with a bigger precision, like this:
double QSweep::getSlope(double a, double b, double c, double d)
{
double slope;
long double temp1, temp2;
temp1 = (d-b);
temp2 = (c-a);
slope = temp1/temp2;
return slope;
}

While the academic discussion going on is great for learning about the limitations of programming languages, you may find the simplest solution to the problem is an data structure for arbitrary precision arithmetic.
This will have some overhead, but you should be able to find something with fairly guaranteeable accuracy.

Related

Very large differences using float and double

#include <iostream>
using namespace std;
int main() {
int steps=1000000000;
float s = 0;
for (int i=1;i<(steps+1);i++){
s += (i/2.0) ;
}
cout << s << endl;
}
Declaring s as float: 9.0072e+15
Declaring s as double: 2.5e+17 (same result as implementing it in Julia)
I understand double has double precision than float, but float should still handle numbers up to 10^38.
I did read similar topics where results where not the same, but in that cases the differences were very small, here the difference is 25x.
I also add that using long double instead gives me the same result as double. If the matter is the precision, I would have expected to have something a bit different.
The problem is the lack of precision: https://en.wikipedia.org/wiki/Floating_point
After 100 million numbers you are adding 1e8 to 1e16 (or at least numbers of that magnitude), but single precision numbers are only accurate to 7 digits - so it is the same as adding 0 to 1e16; that's why your result is considerably lower for float.
Prefer double over float in most cases.
Problem with floating point precision! Infinite real numbers cannot possibly be represented by the finite memory of a computer. Float, in general, are just approximations of the number they are meant to represent.
For more details, please check the following documentation:
https://softwareengineering.stackexchange.com/questions/101163/what-causes-floating-point-rounding-errors
You didn't mention what type of floating point numbers you are using, but I'm going to assume that you use IEEE 754, or similar.
I understand double has double precision
To be more precise with the terminology, double uses twice as many bits. That's not double the number of reprensentable values, it's 4294967296 times as many representable values, despite being named "double precision".
but float should still handle numbers up to 10^38.
Float can handle a few numbers up to that magnitude. But that does't mean that float values in that range are precise. For example, 3,4028235E+38 can be represented as a single precision float. How much would you imagine is the difference between the previous value representable by float? Is it the machine epsilon? Perhaps 0.1? Maybe 1? No. The difference is about 2E+31.
Now, your numbers aren't quite in that range. But, they're outside the continuous range of whole integers that can be precisely represented by float. The highest value in that range happens to be 16777217, or about 1.7E+7, which is way less than 2.5E+17. So, every addition beyond that range adds some error to the result. You perform a billion calculations so those errors add up.
Conclusions:
Understand that single precision is way less precise than double precision.
Avoid long sequences of calculations where precision errors can accumulate.

Rounding in C++ and round-tripping numbers

I have a class that internally represents some quantity in fixed point as 32-bit integer with somewhat arbitrary denominator (it is neither power of 2 nor power of 10).
For communicating with other applications the quantity is converted to plain old double on output and back on input. As code inside the class it looks like:
int32_t quantity;
double GetValue() { return double(quantity) / DENOMINATOR; }
void SetValue(double x) { quantity = x * DENOMINATOR; }
Now I need to ensure that if I output some value as double and read it back, I will always get the same value back. I.e. that
x.SetValue(x.GetValue());
will never change x.quantity (x is arbitrary instance of the class containing the above code).
The double representation has more digits of precision, so it should be possible. But it will almost certainly not be the case with the simplistic code above.
What rounding do I need to use and
How can I find the critical would-be corner cases to test that the rounding is indeed correct?
Any 32 bits will be represented exactly when you convert to a double, but when you divide then multiply by an arbitrary value you will get a similar value but not exactly the same. You should lose at most one bit per operations, which means your double will be almost the same, prior to casting back to an int.
However, since int casts are truncations, you will get the wrong result when very minor errors turn 2.000 into 1.999, thus what you need to do is a simple rounding task prior to casting back.
You can use std::lround() for this if you have C++11, else you can write you own rounding function.
You probably don't care about fairness much here, so the common int(doubleVal+0.5) will work for positives. If as seems likely, you have negatives, try this:
int round(double d) { return d<0?d-0.5:d+0.5; }
The problem you describe is the same problem which exists with converting between binary and decimal representation just with different bases. At least it exists if you want to have the double representation to be a good approximation of the original value (otherwise you could just multiply the 32 bit value you have with your fixed denominator and store the result in a double).
Assuming you want the double representation be a good approximation of your actual value the conversions are nontrivial! The conversion from your internal representation to double can be done using Dragon4 ("How to print floating point numbers accurately", Steele & White) or Grisu ("How to print floating point numbers quickly and accurately", Loitsch; I'm not sure if this algorithm is independent from the base, though). The reverse can be done using Bellerophon ("How to read floating point numbers accurately", Clinger). These algorithms aren't entirely trivial, though...

C++ determining if a number is an integer

I have a program in C++ where I divide two numbers, and I need to know if the answer is an integer or not. What I am using is:
if(fmod(answer,1) == 0)
I also tried this:
if(floor(answer)==answer)
The problem is that answer usually is a 5 digit number, but with many decimals. For example, answer can be: 58696.000000000000000025658 and the program considers that an integer.
Is there any way I can make this work?
I am dividing double a/double b= double answer
(sometimes there are more than 30 decimals)
Thanks!
EDIT:
a and b are numbers in the thousands (about 100,000) which are then raised to powers of 2 and 3, added together and divided (according to a complicated formula). So I am plugging in various a and b values and looking at the answer. I will only keep the a and b values that make the answer an integer. An example of what I got for one of the answers was: 218624 which my program above considered to be an integer, but it really was: 218624.00000000000000000056982 So I need a code that can distinguish integers with more than 20-30 decimals.
You can use std::modf in cmath.h:
double integral;
if(std::modf(answer, &integral) == 0.0)
The integral part of answer is stored in fraction and the return value of std::modf is the fractional part of answer with the same sign as answer.
The usual solution is to check if the number is within a very short distance of an integer, like this:
bool isInteger(double a){
double b=round(a),epsilon=1e-9; //some small range of error
return (a<=b+epsilon && a>=b-epsilon);
}
This is needed because floating point numbers have limited precision, and numbers that indeed are integers may not be represented perfectly. For example, the following would fail if we do a direct comparison:
double d=sqrt(2); //square root of 2
double answer=2.0/(d*d); //2 divided by 2
Here, answer actually holds the value 0.99999..., so we cannot compare that to an integer, and we cannot check if the fractional part is close to 0.
In general, since the floating point representation of a number can be either a bit smaller or a bit bigger than the actual number, it is not good to check if the fractional part is close to 0. It may be a number like 0.99999999 or 0.000001 (or even their negatives), these are all possible results of a precision loss. That's also why I'm checking both sides (+epsilon and -epsilon). You should adjust that epsilon variable to fit your needs.
Also, keep in mind that the precision of a double is close to 15 digits. You may also use a long double, which may give you some extra digits of precision (or not, it is up to the compiler), but even that only gets you around 18 digits. If you need more precision than that, you will need to use an external library, like GMP.
Floating point numbers are stored in memory using a very different bit format than integers. Because of this, comparing them for equality is not likely to work effectively. Instead, you need to test if the difference is smaller than some epsilon:
const double EPSILON = 0.00000000000000000001; // adjust for whatever precision is useful for you
double remainder = std::fmod(numer, denom);
if(std::fabs(0.0 - remainder) < EPSILON)
{
//...
}
Alternatively, if you want to include values that are close to integers (based on your desired precision), you can modify the if condition slightly (since the remainder returned by std::fmod will be in the range [0, 1)):
if (std::fabs(std::round(d) - d) < EPSILON)
{
// ...
}
You can see the test for this here.
Floating point numbers are generally somewhat precise to about 12-15 digits (as a double), but as they are stored as a mantissa (fraction) and a exponent, rational numbers (integers or common fractions) are not likely to be stored as such. For example,
double d = 2.0; // d might actually be 1.99999999999999995
Because of this, you need to compare the difference of what you expect to some very small number that encompasses the precision you desire (we will call this value, epsilon):
double d = 2.0;
bool test = std::fabs(2 - d) < epsilon; // will return true
So when you are trying to compare the remainder from std::fmod, you need to check it against the difference from 0.0 (not for actual equality to 0.0), which is what is done above.
Also, the std::fabs call prevents you from having to do 2 checks by asserting that the value will always be positive.
If you desire a precision that is greater than 15-18 decimal places, you cannot use double or long double; you will need to use a high precision floating point library.

How to check if float is a whole number?

The printf function's %g is able to show the whole number 3 if the float is 3.00, and will show 3.01 if the float's value isn't a round number float.
How would you do this yourself through some code, without formatting the number as a string?
There isn't really a simple answer
Integral values do have exact representations in the float and double formats. So, if it's really already integral, you can use:
f == floor(f)
However, if your value is the result of a calculation which at one point involved any sort of non-zero fractional part, then you will need to be concerned that you may have something very close to an integer but which isn't really, exactly, to-the-last-bit the same. You probably want to consider that to be integral.
One way this might be done:
fabs(f - round(f)) < 0.000001
And while we are on the subject, for the purists, we should note that int i = f; or double i = f; will round according to the FPU mode whereas round(3) will round half-way cases away from zero.

C++ Precision digits for float vs double

So I understand the difference for floats vs doubles in C++.
A double is simply a double float point whereas a float a single
floating point (double the size of precision).
My question is, how come floats are represented as '300000011920928955078125e-24' (for the value 3.0) but a double will simply be shows as 3.0?
Why wouldn't a double display all the trailing digits? It has a higher precision but also still suffers from the same finite precision as floating point, so I'm not sure why it doesn't also show up like that.
That's because many values cannot be represented precisely in floating point (either single or double precision), but double can represent something much much closer. If you print with more decimal places shown, you'll almost certainly see the error.
The value 0.3 cannot be represented precisely. Additionally, the result of a calculation might pick up error from the operands. "0.3 * 10" will magnify the error in the result, so will no be precisely 3.0.
The best analogue is trying to display "1/3" in decimal. You might write it 0.333333, or 0.333333333333. If you multiply those by 3, you'll get 0.999999 or 0.999999999999. Displaying those on a calculator screen (which has a fixed number of digits), the first will be shown with the error, while the second will get rounded up.
Edit: Code to demonstrate this:
#include <stdio.h>
int main()
{
float f = 0.3;
double d = 0.3;
printf("%.50lf %.50lf\n", f, d);
printf("%.10lf %.10lf\n", f, d);
}
Displays:
0.30000001192092895507812500000000000000000000000000 0.29999999999999998889776975374843459576368300000000
0.3000000119 0.3000000000