Loss of precision when casting float to double - c++

I guess I'm hitting a precision issue with my c++ program. And I don't understand why I'm getting different results in my values.
res equals to 1321.0000001192093 if I write:
float sy = -0.207010582f;
double res = -1512.*((double)sy - (2. / 3.));
but res2 equals to 1320.9999999839999 if I write:
double res2 = -1512.*(-0.207010582 - (2. / 3.));
Why even syd is different from syd2 when I write this:
double syd = -0.207010582f;
double syd2 = -0.207010582000000000;
Can somebody give me a hand, to cast my float into a double properly and to understand what's going on ?

-0.207010582f is a decimal floating-point literal. But your computer doesn't use decimal floating point, it uses binary floating point. So the value of that literal will be rounded to float precision.
Similarly, -0.207010582 is rounded to double precision. While that's closer, it still is not equal to -0.207010582 decimal.
Since double has more precision than float, you will not lose precision by casting from float to double. Any rounding will have happened earlier.

Single-Precision
As others have said, float sy = -0.207010582f; initializes a single-precision (32-bit) floating point variable from a single-precision floating point literal.
This will be treated (in storage and calculations) as the nearest representable number in that format. This number is -0.20701058208942413330078125
You code is effectively then float sy = -0.20701058208942413330078125;
You can confirm that this is the nearest representable value by looking at the adjacent single-precision floating point numbers.
-0.20701059699058532714843750 // std::nextafter( sy, std::numeric_limits<float>::lowest() )
-0.20701058208942413330078125 // sy
-0.20701056718826293945312500 // std::nextafter( sy, std::numeric_limits<float>::max() )
Double-Precision
Exactly the same occurs with double-precision floating point numbers, it's just their increased resolution means the differences are small.
e.g double dy = -0.207010582; actually represents the value 0.20701058199999999853702092877938412129878997802734375
Similarly, the adjacent values that can be represented are -
-0.2070105820000000262925965444082976318895816802978515625 // std::nextafter( dy, std::numeric_limits<double>::lowest() )
-0.2070105819999999985370209287793841212987899780273437500 // dy
-0.2070105819999999707814453131504706107079982757568359375 // std::nextafter( dy, std::numeric_limits<double>::max() )
Single to Double Conversion
All single precision floating point values are exactly representable in double-precision. Hence, nothing is lost in conversions from single to double precision.
All the above assumes IEEE754 floating-point representation.

Related

How to increase accuracy of floating point second derivative calculation?

I've written a simple program to calculate the first and second derivative of a function, using function pointers. My program computes the correct answers (more or less), but for some functions, the accuracy is less than I would like.
This is the function I am differentiating:
float f1(float x) {
return (x * x);
}
These are the derivative functions, using the central finite difference method:
// Function for calculating the first derivative.
float first_dx(float (*fx)(float), float x) {
float h = 0.001;
float dfdx;
dfdx = (fx(x + h) - fx(x - h)) / (2 * h);
return dfdx;
}
// Function for calculating the second derivative.
float second_dx(float (*fx)(float), float x) {
float h = 0.001;
float d2fdx2;
d2fdx2 = (fx(x - h) - 2 * fx(x) + fx(x + h)) / (h * h);
return d2fdx2;
}
Main function:
int main() {
pc.baud(9600);
float x = 2.0;
pc.printf("**** Function Pointers ****\r\n");
pc.printf("Value of f(%f): %f\r\n", x, f1(x));
pc.printf("First derivative: %f\r\n", first_dx(f1, x));
pc.printf("Second derivative: %f\r\n\r\n", second_dx(f1, x));
}
This is the output from the program:
**** Function Pointers ****
Value of f(2.000000): 4.000000
First derivative: 3.999948
Second derivative: 1.430511
I'm happy with the accuracy of the first derivative, but I believe the second derivative is too far off (it should be equal to ~2.0).
I have a basic understanding of how floating point numbers are represented and why they are sometimes inaccurate, but how can I make this second derivative result more accurate? Could I be using something better than the central finite difference method, or is there a way I can get better results with the current method?
The accuracy can be increased by choosing a type which has more precision. float is currently defined as an IEEE-754 32-bit number, giving you a precision of ~7.225 decimal places.
What you want is the 64-bit counterpart: double with ~15.955 decimal places accuracy.
That should be sufficient for your calculation, however worth mentioning is boosts implementation which offers a quadruple-precision floating point number (128-bit).
Finally The GNU Multiple Precision Arithmetic Library offers types with an arbitrary number of decimal places for precision.
Go analytical. ;-) probably not an option given "with the current
method".
Use double instead of float.
Vary the epsilon (h), and combine the results in some way. For example you could try 0.00001, 0.000001, 0.0000001 and average them. In fact, you'd want the result with the smallest h that doesn't overflow/underflow. But it's not clear how to detect overflow and underflow.

Precision issues when converting a decimal number to its rational equivalent

I have problem of converting a double (say N) to p/q form (rational form), for this I have the following strategy :
Multiply double N by a large number say $k = 10^{10}$
then p = y*k and q = k
Take gcd(p,q) and find p = p/gcd(p,q) and q = p/gcd(p,q)
when N = 8.2 , Answer is correct if we solve using pen and paper, but as 8.2 is represented as 8.19999999 in N (double), it causes problem in its rational form conversion.
I tried it doing other way as : (I used a large no. 10^k instead of 100)
if(abs(y*100 - round(y*100)) < 0.000001) y = round(y*100)/100
But this approach also doesn't give right representation all the time.
Is there any way I could carry out the equivalent conversion from double to p/q ?
Floating point arithmetic is very difficult. As has been mentioned in the comments, part of the difficulty is that you need to represent your numbers in binary.
For example, the number 0.125 can be represented exactly in binary:
0.125 = 2^-3 = 0b0.001
But the number 0.12 cannot.
To 11 significant figures:
0.12 = 0b0.00011110101
If this is converted back to a decimal then the error becomes obvious:
0b0.00011110101 = 0.11962890625
So if you write:
double a = 0.2;
What the machine actually does is find the closest binary representation of 0.2 that it can hold within a double data type. This is an approximation since as we saw above, 0.2 cannot be exactly represented in binary.
One possible approach is to define an 'epsilon' which determines how close your number can be to the nearest representable binary floating point.
Here is a good article on floating points:
https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
have problem of converting a double (say N) to p/q form
... when N = 8.2
A typical double cannot encode 8.2 exactly. Instead the closest representable double is about
8.19999999999999928945726423989981412887573...
8.20000000000000106581410364015027880668640... // next closest
When code does
double N = 8.2;
It will be the 8.19999999999999928945726423989981412887573... that is converted into rational form.
Converting a double to p/q form:
Multiply double N by a large number say $k = 10^{10}$
This may overflow the double. First step should be to determine if the double is large, it which case, it is a whole number.
Do not multiple by some power of 10 as double certainly uses a binary encoding. Multiplication by 10, 100, etc. may introduce round-off error.
C implementations of double overwhelmingly use a binary encoding, so that FLT_RADIX == 2.
Then every finite double x has a significand that is a fraction of some integer over some power of 2: a binary fraction of DBL_MANT_DIG digits #Richard Critten. This is often 53 binary digits.
Determine the exponent of the double. If large enough or x == 0.0, the double is a whole number.
Otherwise, scale a numerator and denominator by DBL_MANT_DIG. While the numerator is even, halve both the numerator and denominator. As the denominator is a power-of-2, no other prime values are needed for simplification consideration.
#include <float.h>
#include <math.h>
#include <stdio.h>
void form_ratio(double x) {
double numerator = x;
double denominator = 1.0;
if (isfinite(numerator) && x != 0.0) {
int expo;
frexp(numerator, &expo);
if (expo < DBL_MANT_DIG) {
expo = DBL_MANT_DIG - expo;
numerator = ldexp(numerator, expo);
denominator = ldexp(1.0, expo);
while (fmod(numerator, 2.0) == 0.0 && denominator > 1.0) {
numerator /= 2.0;
denominator /= 2.0;
}
}
}
int pre = DBL_DECIMAL_DIG;
printf("%.*g --> %.*g/%.*g\n", pre, x, pre, numerator, pre, denominator);
}
int main(void) {
form_ratio(123456789012.0);
form_ratio(42.0);
form_ratio(1.0 / 7);
form_ratio(867.5309);
}
Output
123456789012 --> 123456789012/1
42 --> 42/1
0.14285714285714285 --> 2573485501354569/18014398509481984
867.53089999999997 --> 3815441248019913/4398046511104

Precision of acos function in c++

i am trying to calculate the distance between two points and using the acos() function in process...but i am not getting a precise result..in case the distance is small
float distance_between(dest& point1,dest point2) {
float EARTH_RADIUS = 6371.0;//in km
float point1_lat_in_radians = point1.lat*(PI/180);
float point2_lat_in_radians = point2.lat*(PI/180);
float point1_long_in_radians = point1.lon*(PI/180);
float point2_long_in_radians = point2.lon*(PI/180);
float res = acos( sin( point1_lat_in_radians ) * sin( point2_lat_in_radians ) + cos( point1_lat_in_radians ) * cos( point2_lat_in_radians ) * cos( point2_long_in_radians - point1_long_in_radians) ) * EARTH_RADIUS;
cout<<res<<endl;
res = round(res*100)/100;
return res;
}
i am checking the distance between the following co-ordinates
52.378281 4.900070 and 52.379141 4.880590
52.373634 4.890289 and 52.379141 4.880590
the result is 0 in both cases..i know the distance is small but is there a way to get precise distance like 0.xxx?
Use double instead of float to get more precision.
That way you are going to use this prototype:
double acos (double x);
A must read is the Difference between float and double question. From there we have:
As the name implies, a double has 2x the precision of float.
The C and C++ standards do not specify the representation of float,
double and long double. It is possible that all three implemented as
IEEE double-precision. Nevertheless, for most architectures (gcc,
MSVC; x86, x64, ARM) float is indeed a IEEE single-precision
floating point number (binary32), and double is a IEEE
double-precision floating point number (binary64).

How to calculate double variable with 10 decimal precision in C++

I'm going to calculate double type with 10 decimal point precision. of course, I hope the result precesion has same 10 decimal precision. However, it doesn't work, only 6 decimal point is possible in VS2008. Is there any reason or any idea ?
double dMag = 10;
double dPixelSize = 4.4;
double PixelWidth = 1024/2;
double PixelHeight = 768/2;
double umtomm = 1000;
double origin_PosX = 813.227696;
double origin_PosY = 676.195748;
double PosX = (origin_PosX - PixelWidth) * dPixelSize / dMag / umtomm;
double PosY = (origin_PosY - PixelHeight ) * dPixelSize / dMag / umtomm;
If I check PosX and PosY, then results are "0.132540, 0.128566" respectively. I expect that results are "0.13254018624000002, 0.12856612912" respectively.
thanks
From what I gather you are confusing a few things...
The precision of double is more than just 6 decimal places. I strongly recommend reading this wikipedia article which explains how the underlying double datatype stores your number and explains its precision.
The precision that is printed is limited to 6 decimal places by default. In order to fix that you need to learn how to use the format string.
In your comments you mentioned using
tmp1.Format("%f,%f\r\n", PosX, PosY); file.Write(tmp1,lstrlen(tmp1));
to print the strings.
Try changing this line to
tmp1.Format("%.10f,%.10f\r\n", PosX, PosY); file.Write(tmp1,lstrlen(tmp1));
and you will see that it will print 10 digits after the decimal point.

In C++, what is the keyword used to refer to a 32-bit floating point value?

In C++, what is the keyword used to refer to a 32-bit floating point value?:
float
This is almost always a 32b IEEE floating point
float
here's an example:
float var = 0.0f;
Notice the lowercase f to indicate the literal should be interpreted as a 32-bit floating point number.
float - 32 bits
double - 64 bits
Currently, the IEEE-754 32-bit floating point is represented by the keyword float.
float myVar = 0.8;
myVar = 4.0f;
For 64-bit floating point values, there's double:
double myVar = 0.8;
myVar = 4.0f;