C++ float vs double cout setprecision oddities(newbie) - c++

Can anyone explain why these two variable of the same value can output different values when i use setprecision()?
#include <iostream>
#include <iomanip>
int main()
{
float a=98.765;
double b = 98.765;
//std::cout<<std::setprecision(2)<<a<<std::endl;
std::cout<<std::fixed;
std::cout<<std::setprecision(2)<<a<<std::endl;
std::cout<<std::setprecision(2)<<b<<std::endl;
}
The output for a will be 98.76 while the output for b will be 98.77.

Those variables don't have the same value. When you shoehorn the literal double of 98.765 into the float, it has to do a best fit, and some precision is lost.
You can see this quite easily if you change the precision to 50, you'll also see that not even the double can represent that value exactly:
98.76499938964843750000000000000000000000000000000000
98.76500000000000056843418860808014869689941406250000
However, the important thing is that the former float variable will round down, the latter double will round up.
See also the IEEE754 online converter.

Related

Loss of precision while working with double

Could we work with big numbers up to 10^308.
How can I calculate the 11^105 using just double?
The answer of (11^105) is:
22193813979407164354224423199022080924541468040973950575246733562521125229836087036788826138225193142654907051
Is it possible to get the correct result of 11^105?
As I know double can handle 10^308 which is much bigger than 11^105.
I know that this code is wrong:
#include <iostream>
#include <cstdio>
#include <cmath>
#include <iomanip>
using namespace std;
int main()
{
double n, p, x;
cin >> n >> p;
//scanf("%lf %lf", &n,&p);
x = exp(log((double)n)*p);
//printf("%lf\n", x);
cout << x <<endl;
return 0;
}
Thanks.
double usually has 11bit for exp (-1022~1023 normalized), 52bit for fact and 1bit for sign. Thus 11^105 cannot be represented accurately.
For more explanation, see IEEE 754 on Wikipedia
Double can hold very large results, but not high precision. In constrast to fixed point numbers, double is floating point real number. This means, for the same accuracy double can shift the radix to handle different range of number and thus you see high range.
For your purpose, you need some home cooked big num library, or you can find one readily available and written by someone else.
BTW my home cooked recipe gives different answer for 11105
Confirmed with this haskell code

c++ precision issue in storing floating point numbers

I'm handling some mathematical calculation.
I'm losing precision. But i need extreme precision.
I then used to check the precision issue with the code given below.
Any solution for getting the precision?
#include <iostream>
#include <stdlib.h>
#include <cstdio>
#include <sstream>
#include <iomanip>
using namespace std;
int main(int argc,char** arvx)
{
float f = 1.00000001;
cout << "f: " <<std::setprecision(20)<< f <<endl;
return 0;
}
Output is
f: 1
If you truly want precise representation of these sorts of numbers (ie, with very small fractional components many places beyond the decimal point), then floating point types like float or even the much more precise double may still not give you the exact results you are looking for in all circumstances. Floating point types can only approximate some values with small fractional components.
You may need to use some sort of high precision fixed point C++ type in order to get exact representation of very small fractions in your values, and resulting accurate calculated results when you perform mathematical operations on such numbers. The following question/answers may provide you with some useful pointers: C++ fixed point library?
in c++
float f = 1.00000001;
support only 6 digits after decimal point
float f = 1.000001;
if you want more real calculation use double

Why is 10000000000000000 != 10000000000000000?

To begin with, take a look at the following code in Visual Studio using C++:
float a = 10000000000000000.0;
float b = a - 10000000000000000.0;
When printing them out, it turns out:
a = 10000000272564224.000000
b = 272564224.000000
And when viewing them in Watch under Debug, it turns out:
-Name- -Value- -Type-
a 1.0000000e+016 float
b 2.7256422e+008 float
Pre-question: I am sure that 10000000000000000.0 is within the range of float. Why is that we cannot get correct a/ b using float?
Followup-question:
For pre-question, based on all great following answers. I know that the reason is that a 32-bit float has an accuracy of about 7 digits, so beyond the first 6-7 digits, all bets are off. That's why the math doesn't work out, and printing looks wrong for these large numbers. I have to use double for more accuracy. So why float claims to be able to handle large numbers and at the same time we cannot trust it?
The huge number you are using is indeed within the "range" of float, but not all its digits are within the "accuracy" of float. A 32-bit float has an accuracy of about 7 digits, so beyond the first 6-7 digits, all bets are off. That's why the math doesn't work out, and printing looks "wrong" when you use these large numbers. If you want more accuracy, use double. For more, see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers
A float number takes about 6-7 decimal places (23 bit for the fraction) so any number with more decimal places is just an approximation. Which leads to that rondom number.
For more about floating point format precision: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
For the updated question:
You should never use floating point format when the precision is required. We can't just specify larger space of memory. Handling numbers with very large amount of decimal places needs a very large amount of memory .So more complicated methods are used instead ( for exemple using a string format then processing the characters successively) .
To avoid that problem use double which gives about 16-17 decimal places (52 bit for the fraction) or long double for even more precision.
#include <stdio.h>
int main()
{
double a = 10000000000000000.0;
double b = a - 10000000000000000.0;
printf("%f\n%f", a, b);
}
exemple http://ideone.com/rJN1QI
Your confusion is caused by implicit conversions and lack of accuracy of float.
Let me fill in the implicit conversions for you:
float a = (float)10000000000000000.0;
float b = (float)((double)a - 10000000000000000.0);
This converts the literal double to float, and the closest it can get is 10000000272564224. And then the subtraction is performed using double, not float, so the second 10000000000000000.0 does not lose precision.
We can use the nextafter function to get a better idea of the precision of floating-point types. nextafter takes two arguments; it returns the adjacent representable number to its first argument, in the direction of its second argument.
The value 10000000000000000.0 (or 1.0e16) is well within the range of representable values of type float, but that value itself cannot be represented exactly.
Here's a small program that illustrates the issue:
#include <math.h>
#include <stdio.h>
int main()
{
float a = 10000000000000000.0;
double d_a = 10000000000000000.0;
printf(" %20.2f\n", nextafterf(a, 0.0f));
printf("a = %20.2f\n", a);
printf(" %20.2f\n", nextafterf(a, 1.0e30f));
putchar('\n');
printf(" %20.2f\n", nextafter(d_a, 0.0));
printf("d_a = %20.2f\n", d_a);
printf(" %20.2f\n", nextafter(d_a, 1.0e30));
putchar('\n');
}
and here's its output on my system:
9999999198822400.00
a = 10000000272564224.00
10000001346306048.00
9999999999999998.00
d_a = 10000000000000000.00
10000000000000002.00
If you use type float, the closest you can get to 10000000000000000.00 is 10000000272564224.00.
But in your second declaration:
float b = a - 10000000000000000.0
the subtraction is done in type double; the constant 10000000000000000.0 is already of type double, and a is promoted to double to match. So this takes the poor approximation of 1.0e16 that's stored in a, and subtracts from it the much better approximation (in fact it's exact) that can be represented in type double.

Convert double to mpf_class precisely

What is the correct way to initialize GMP floating point variables (mpf_t or mpf_class, does not matter) from double?
Code:
#include <iostream>
#include <gmpxx.h>
int main()
{
double d=0.1;
//1024 bits is more that 300 decimal digits
mpf_set_default_prec(1024);
mpf_class m(d);
//after initializing mpf_class variable, set default output precision
std::cout.precision(50);
std::cout.setf(std::ios_base::scientific);
std::cout << m << std::endl;
return 0;
}
The output is:
1.00000000000000005551115123125782702118158340454102e-01
It would be okay, if I printed d directly, but in the m variable 300 decimal digits of mantissa are trusted! I use GMP for an iterative numerical method, so these non-zeros introduce mistake and make the method converge slowly.
If I initilize m as mpf_class m("0.1");, the output is:
1.00000000000000000000000000000000000000000000000000e-01
So the problem is not in operator<< overload for mpf_class. The problem exists not only for initializing, but for assigning too.
At present I use the following:
mpf_class linsys::DtoGMP(const double& arg)
{
char buf[512];
sprintf(buf,"%.15le\n",arg);
return mpf_class(buf);
}
for correct conversion.
Is there a faster and/or more native way to do it?
My OS is OpenSUSE 12.1, compiler: gcc 4.6.2
If you print out the double with that same precision, you should see the same strange-looking number. That's simply because 0.1 can't be accurately represented in floating point. The mpf_class is accurately reproducing the value stored in the double. It's the double that isn't matching your expectations.
There's probably a way to specify a precision to gmp or some way to round the input. I'm not sure where to look though.
Edit
mpf_class has a constructor with a precision parameter: http://www.gnu.org/software/gmp/manual/html_node/C---Interface-Floats.html
You may use this method
mpf_class a;
double d=0.1;
a=static_cast<mpf_class>(d*10)/static_cast<mpf_class>(10);
this method can be used if you know how many decimal places a double has

Using a long double or just a double for calculating pi?

I'm calculating pi using a long winded formula. I'm trying to get more familiar with floating point numbers etc. I have a working program that uses doubles. The problem with my code is:
If I use a double, pi is only accurate to the 7th decimal place. I can't get it to be any more accurate.
If I use a long double, pi is accurate up to the 9th decimal place however the code takes much longer to run. If I check for precision for less than 0.00000001 using a long double, pi returns a value of 9.4246775. I assume that this is due to the long double.
My question is what is the most accurate variable type? How could I change my code to improve the precision of pi?
Here is my code:
#include <iomanip>
#include <cstdlib>
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
double arctan;
double pi;
double precision;
double previous=0;
int y=3;
int loopcount=0;
cout<<"Start\n";
arctan=1-(pow(1,y)/y);
do
{
y=y+2;
arctan=arctan+(pow(1,y)/y);
y=y+2;
arctan=arctan-(pow(1,y)/y);
pi=4*(arctan);
// cout<<"Pi is: ";
// cout<<setprecision(12)<<pi<<endl;
precision=(pi*(pow(10,10)/10));
loopcount++;
if(precision-previous<0.000000001)
break;
previous=precision;
}
while(true);
cout<<"Pi is:"<<endl;
cout<<setprecision(11)<<pi<<endl;
cout<<"Times looped:"<<endl;
cout<<loopcount<<endl;
return 0;
}
You can get the max limits of doubles/long doubles from std::numeric_limits
#include <iostream>
#include <limits>
int main()
{
std::cout << " Double::digits10: " << std::numeric_limits<double>::digits10 << "\n";
std::cout << "Long Double::digits10: " << std::numeric_limits<long double>::digits10 << "\n";
}
On my machine this gives:
Double::digits10: 15
Long Double::digits10: 18
So I expect long double to be accurate to 18 digits.
The definition of this term can be found here:
http://www.cplusplus.com/reference/std/limits/numeric_limits/
Standard quote: 18.3.2 Numeric limits [limits]
Also Note: As the comment is way down in the above list:
That #sarnold is incorrect (though mysteriously he has two silly people up-voting his comment without checking) in his assertions on pow(). What he states is only applicable to C. C++ has overloads for the types because in C++ pow() is a template function. See: http://www.cplusplus.com/reference/clibrary/cmath/pow/ in the standard at 26.4.7 complex value operations [complex.value.ops]
The predefined floating-point type with the greatest precision is long double.
There are three predefined floating-point types:
float has at least 6 decimal digits of precision
double has at least 10, and at least as many as float
long double has at least 10, and at least as many as double
These are minimum requirements; any or all of these types could have more precision.
If you need more precision than long double can provide, you might look at GMP, which supports arbitrary precision (at considerable expense in speed and memory usage).
Or, you could just hard-code the digits of PI and see what happens. ^_^
http://www.joyofpi.com/pi.html