I'm going to calculate double type with 10 decimal point precision. of course, I hope the result precesion has same 10 decimal precision. However, it doesn't work, only 6 decimal point is possible in VS2008. Is there any reason or any idea ?
double dMag = 10;
double dPixelSize = 4.4;
double PixelWidth = 1024/2;
double PixelHeight = 768/2;
double umtomm = 1000;
double origin_PosX = 813.227696;
double origin_PosY = 676.195748;
double PosX = (origin_PosX - PixelWidth) * dPixelSize / dMag / umtomm;
double PosY = (origin_PosY - PixelHeight ) * dPixelSize / dMag / umtomm;
If I check PosX and PosY, then results are "0.132540, 0.128566" respectively. I expect that results are "0.13254018624000002, 0.12856612912" respectively.
thanks
From what I gather you are confusing a few things...
The precision of double is more than just 6 decimal places. I strongly recommend reading this wikipedia article which explains how the underlying double datatype stores your number and explains its precision.
The precision that is printed is limited to 6 decimal places by default. In order to fix that you need to learn how to use the format string.
In your comments you mentioned using
tmp1.Format("%f,%f\r\n", PosX, PosY); file.Write(tmp1,lstrlen(tmp1));
to print the strings.
Try changing this line to
tmp1.Format("%.10f,%.10f\r\n", PosX, PosY); file.Write(tmp1,lstrlen(tmp1));
and you will see that it will print 10 digits after the decimal point.
Related
I am trying to round a decimal to 2 digits. To do so, I am multiplying my decimal by 100 and using the floor() function (because I need to round down so that fractions of a penny - I am working with finances - are not counted), then dividing that number by 100.
In the case of the issue, I am doing 86/100 and it is returning 85.99999999999 and I am not sure why.
int main()
{
double balance = 133.45;
double apr = 7.8;
double newBalance = balance * pow((1+((apr/100)/1)), (1));
double interest = newBalance - balance;
double month = interest / 12;
double total = floor(month * 100) / 100;
return 0;
}
I have looked through other sources online, and I haven't been able to find out why.
Is it an edge case with the floor() function? This is probably a dumb question, but I couldn't find an answer.
I guess I'm hitting a precision issue with my c++ program. And I don't understand why I'm getting different results in my values.
res equals to 1321.0000001192093 if I write:
float sy = -0.207010582f;
double res = -1512.*((double)sy - (2. / 3.));
but res2 equals to 1320.9999999839999 if I write:
double res2 = -1512.*(-0.207010582 - (2. / 3.));
Why even syd is different from syd2 when I write this:
double syd = -0.207010582f;
double syd2 = -0.207010582000000000;
Can somebody give me a hand, to cast my float into a double properly and to understand what's going on ?
-0.207010582f is a decimal floating-point literal. But your computer doesn't use decimal floating point, it uses binary floating point. So the value of that literal will be rounded to float precision.
Similarly, -0.207010582 is rounded to double precision. While that's closer, it still is not equal to -0.207010582 decimal.
Since double has more precision than float, you will not lose precision by casting from float to double. Any rounding will have happened earlier.
Single-Precision
As others have said, float sy = -0.207010582f; initializes a single-precision (32-bit) floating point variable from a single-precision floating point literal.
This will be treated (in storage and calculations) as the nearest representable number in that format. This number is -0.20701058208942413330078125
You code is effectively then float sy = -0.20701058208942413330078125;
You can confirm that this is the nearest representable value by looking at the adjacent single-precision floating point numbers.
-0.20701059699058532714843750 // std::nextafter( sy, std::numeric_limits<float>::lowest() )
-0.20701058208942413330078125 // sy
-0.20701056718826293945312500 // std::nextafter( sy, std::numeric_limits<float>::max() )
Double-Precision
Exactly the same occurs with double-precision floating point numbers, it's just their increased resolution means the differences are small.
e.g double dy = -0.207010582; actually represents the value 0.20701058199999999853702092877938412129878997802734375
Similarly, the adjacent values that can be represented are -
-0.2070105820000000262925965444082976318895816802978515625 // std::nextafter( dy, std::numeric_limits<double>::lowest() )
-0.2070105819999999985370209287793841212987899780273437500 // dy
-0.2070105819999999707814453131504706107079982757568359375 // std::nextafter( dy, std::numeric_limits<double>::max() )
Single to Double Conversion
All single precision floating point values are exactly representable in double-precision. Hence, nothing is lost in conversions from single to double precision.
All the above assumes IEEE754 floating-point representation.
I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.19999999999999928945726423989981412887573...
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
}
}
}
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
}
int main(void) {
form_ratio(123456789012.0);
form_ratio(42.0);
form_ratio(1.0 / 7);
form_ratio(867.5309);
}
Output
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104
I am writing a program for class that simply calculates distance between two coordinate points (x,y).
differenceofx1 = x1 - x2;
differenceofy1 = y1 - y2;
squareofx1 = differenceofx1 * differenceofx1;
squareofy1 = differenceofy1 * differenceofy1;
distance1 = sqrt(squareofx1 - squareofy1);
When I calculate the distance, it works. However there are some situations such as the result being a square root of a non-square number, or the difference of x1 and x2 / y1 and y2 being negative due to the input order, that it just gives a distance of 0.00000 when the distance is clearly more than 0. I am using double for all the variables, should I use float instead for the negative possibility or does double do the same job? I set the precision to 8 as well but I don't understand why it wouldn't calculate properly?
I am sorry for the simplicity of the question, I am a bit more than a beginner.
You are using the distance formula wrong
it should be
distance1 = sqrt(squareofx1 + squareofy1);
instead of
distance1 = sqrt(squareofx1 - squareofy1);
due to the wrong formula if squareofx1 is less than squareofy1 you get an error as sqrt of a negative number is not possible in case of real coordinates.
Firstly, your formula is incorrect change it to distance1 = sqrt(squareofx1 + squareofy1) as #fefe mentioned. Btw All your calculation can be represented in one line of code:
distance1 = sqrt((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2));
No need for variables like differenceofx1, differenceofy1, squareofx1, squareofy1 unless you are using the results stored in these variables again in your program.
Secondly, Double give you more precision than float. If you need precision more than 6-7 places after decimal use Double else float works too. Read more about Float vs Double
This question already has answers here:
Why does floating-point arithmetic not give exact results when adding decimal fractions?
(31 answers)
Why pow(10,5) = 9,999 in C++
(8 answers)
Closed 4 years ago.
I've found an interesting floating point problem. I have to calculate several square roots in my code, and the expression is like this:
sqrt(1.0 - pow(pos,2))
where pos goes from -1.0 to 1.0 in a loop. The -1.0 is fine for pow, but when pos=1.0, I get an -nan. Doing some tests, using gcc 4.4.5 and icc 12.0, the output of
1.0 - pow(pos,2) = -1.33226763e-15
and
1.0 - pow(1.0,2) = 0
or
poss = 1.0
1.0 - pow(poss,2) = 0
Where clearly the first one is going to give problems, being negative. Anyone knows why pow is returning a number smaller than 0? The full offending code is below:
int main() {
double n_max = 10;
double a = -1.0;
double b = 1.0;
int divisions = int(5 * n_max);
assert (!(b == a));
double interval = b - a;
double delta_theta = interval / divisions;
double delta_thetaover2 = delta_theta / 2.0;
double pos = a;
//for (int i = 0; i < divisions - 1; i++) {
for (int i = 0; i < divisions+1; i++) {
cout<<sqrt(1.0 - pow(pos, 2)) <<setw(20)<<pos<<endl;
if(isnan(sqrt(1.0 - pow(pos, 2)))){
cout<<"Danger Will Robinson!"<<endl;
cout<< sqrt(1.0 - pow(pos,2))<<endl;
cout<<"pos "<<setprecision(9)<<pos<<endl;
cout<<"pow(pos,2) "<<setprecision(9)<<pow(pos, 2)<<endl;
cout<<"delta_theta "<<delta_theta<<endl;
cout<<"1 - pow "<< 1.0 - pow(pos,2)<<endl;
double poss = 1.0;
cout<<"1- poss "<<1.0 - pow(poss,2)<<endl;
}
pos += delta_theta;
}
return 0;
}
When you keep incrementing pos in a loop, rounding errors accumulate and in your case the final value > 1.0. Instead of that, calculate pos by multiplication on each round to only get minimal amount of rounding error.
The problem is that floating point calculations are not exact, and that 1 - 1^2 may be giving small negative results, yielding an invalid sqrt computation.
Consider capping your result:
double x = 1. - pow(pos, 2.);
result = sqrt(x < 0 ? 0 : x);
or
result = sqrt(abs(x) < 1e-12 ? 0 : x);
setprecision(9) is going to cause rounding. Use a debugger to see what the value really is. Short of that, at least set the precision beyond the possible size of the type you're using.
You will almost always have rounding errors when calculating with doubles, because the double type has only 15 significant decimal digits (52 bits) and a lot of decimal numbers are not convertible to binary floating point numbers without rounding. The IEEE standard contains a lot of effort to keep those errors low, but by principle it cannot always succeed. For a thorough introduction see this document
In your case, you should calculate pos on each loop and round to 14 or less digits. That should give you a clean 0 for the sqrt.
You can calc pos inside the loop as
pos = round(a + interval * i / divisions, 14);
with round defined as
double round(double r, int digits)
{
double multiplier = pow(digits,10);
return floor(r*multiplier + 0.5)/multiplier;
}