C++ pow unusual type conversion - c++

When I directly output std::pow(10,2), I get 100 while doing (long)(pow(10,2)) gives 99. Can someone explained this please ?
cout<<pow(10,2)<<endl;
cout<<(long)(pow(10,2))<<endl;
The code is basically this in the main function.
The compiler is mingw32-g++.exe -std=c++11 using CodeBlocks
Windows 8.1 if that helps

Floating point numbers are approximations. Occasionally you get a number that can be exactly represented, but don't count on it. 100 should be representable, but in this case it isn't. Something injected an approximation and ruined it for everybody.
When converting from a floating point type to an integer, the integer cannot hold any fractional values so they are unceremoniously dropped. There is no implicit rounding off, the fraction is discarded. 99.9 converts to 99. 99 with a million 9s after it is 99.
So before converting from a floating point type to an integer, round the number, then convert. Unless discarding the fraction is what you want to do.
cout, and most output routines, politely and silently round floating point values before printing, so if there is a bit of an approximation the user isn't bothered with it.
This inexactness is also why you shouldn't directly compare floating point values. X probably isn't exactly pi, but it might be close enough for your computations, so you perform the comparison with an epsilon, a fudge factor, to tell if you are close enough.
What I find amusing, and burned a lot of time trying to sort out, is would not have even seen this problem if not for using namespace std;.
(long)pow(10,2) provides the expected result of 100. (long)std::pow(10,2) does not. Some difference in the path from 10,2 to 100 taken by pow and std::pow results in slightly different results. By pulling the entire std namespace into their file, OP accidentally shot themselves in the foot.
Why is that?
Up at the top of the file we have using namespace std; this means the compiler is not just considering double pow(double, double) when looking for pow overloads, it can also call std::pow and std::pow is a nifty little template making sure that when called with datatypes other than float and double the right conversions are taking place and everything is the same type.
(long)(pow(10,2))
Does not match
double pow(double, double)
as well as it matches a template instantiation of
double std::pow(int, int)
Which, near as I can tell resolves down to
return pow(double(10), double(2));
after some template voodoo.
What the difference between
pow(double(10), double(2))
and
pow(10, 2)
with an implied conversion from int to double on the call to pow is, I do not know. Call in the language lawyers because it's something subtle.
If this is purely a rounding issue then
auto tempa = std::pow(10, 2);
should be vulnerable because tempa should be exactly what std::pow returns
cout << tempa << endl;
cout << (long) tempa << endl;
and the output should be
100
99
I get
100
100
So immediately casting the return of std::pow(10, 2) into a long is different from storing and then casting. Weird. auto tempa is not exactly what std::pow returns or there is something else going on that is too deep for me.

These are the std::pow overloads:
float pow( float base, float exp );
double pow( double base, double exp );
long double pow( long double base, long double exp );
float pow( float base, int iexp );//(until C++11)
double pow( double base, int iexp );//(until C++11)
long double pow( long double base, int iexp ); //(until C++11)
Promoted pow( Arithmetic1 base, Arithmetic2 exp ); //(since C++11)
But your strange behaviour is MINGW's weirdness about double storage and how the windows run-time doesnt like it. I'm assuming windows is seeing something like 99.9999 and when that is cast to an integral type it takes the floor.
int a = 3/2; // a is = 1
mingw uses the Microsoft C run-time libraries and their implementation of printf does not support the 'long double' type. As a work-around, you could cast to 'double' and pass that to printf instead.
Therefore, you need double double:
On the x86 architecture, most C compilers implement long double as the 80-bit extended precision type supported by x86 hardware (sometimes stored as 12 or 16 bytes to maintain data structure alignment), as specified in the C99 / C11 standards (IEC 60559 floating-point arithmetic (Annex F)). An exception is Microsoft Visual C++ for x86, which makes long double a synonym for double.[2] The Intel C++ compiler on Microsoft Windows supports extended precision, but requires the /Qlong‑double switch for long double to correspond to the hardware's extended precision format.[3]

Related

Codeblocks compiler giving wrong output compared to Online compiler [duplicate]

Consider the following piece of code:
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
It outputs "122" instead of "123". Is it a bug in g++ 4.7.2 (MinGW, Windows XP)?
std::pow() works with floating point numbers, which do not have infinite precision, and probably the implementation of the Standard Library you are using implements pow() in a (poor) way that makes this lack of infinite precision become relevant.
However, you could easily define your own version that works with integers. In C++11, you can even make it constexpr (so that the result could be computed at compile-time when possible):
constexpr int int_pow(int b, int e)
{
return (e == 0) ? 1 : b * int_pow(b, e - 1);
}
Here is a live example.
Tail-recursive form (credits to Dan Nissenbaum):
constexpr int int_pow(int b, int e, int res = 1)
{
return (e == 0) ? res : int_pow(b, e - 1, b * res);
}
All the other answers so far miss or dance around the one and only problem in the question:
The pow in your C++ implementation is poor quality. It returns an inaccurate answer when there is no need to.
Get a better C++ implementation, or at least replace the math functions in it. The one pointed to by Pascal Cuoq is good.
Not with mine at least:
$ g++ --version | head -1
g++ (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
$ ./a.out
123
IDEone is also running version 4.7.2 and gives 123.
Signatures of pow() from http://www.cplusplus.com/reference/cmath/pow/
double pow ( double base, double exponent );
long double pow ( long double base, long double exponent );
float pow ( float base, float exponent );
double pow ( double base, int exponent );
long double pow ( long double base, int exponent );
You should set double base = 10.0; and double i = 23.0.
If you simply write
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
what do you think is pow supposed to refer to? The C++ standard does not even guarantee that after including cmath you'll have a pow function at global scope.
Keep in mind that all the overloads are at least in the std namespace. There is are pow functions that take an integer exponent and there are pow functions that take floating point exponents. It is quite possible that your C++ implementation only declares the C pow function at global scope. This function takes a floating point exponent. The thing is that this function is likely to have a couple of approximation and rounding errors. For example, one possible way of implementing that function is:
double pow(double base, double power)
{
return exp(log(base)*power);
}
It's quite possible that pow(10.0,2.0) yields something like 99.99999999992543453265 due to rounding and approximation errors. Combined with the fact that floating point to integer conversion yields the number before the decimal point this explains your result of 122 because 99+3=122.
Try using an overload of pow which takes an integer exponent and/or do some proper rounding from float to int. The overload taking an integer exponent might give you the exact result for 10 to the 2nd power.
Edit:
As you pointed out, trying to use the std::pow(double,int) overload also seems to yield a value slightly less 100. I took the time to check the ISO standards and the libstdc++ implementation to see that starting with C++11 the overloads taking integer exponents have been dropped as a result of resolving defect report 550. Enabling C++0x/C++11 support actually removes the overloads in the libstdc++ implementation which could explain why you did not see any improvement.
Anyhow, it is probably a bad idea to rely on the accuracy of such a function especially if a conversion to integer is involved. A slight error towards zero will obviously make a big difference if you expect a floating point value that is an integer (like 100) and then convert it to an int-type value. So my suggestion would be write your own pow function that takes all integers or take special care with respect to the double->int conversion using your own round function so that a slight error torwards zero does not change the result.
Your problem is not a bug in gcc, that's absolutely certain. It may be a bug in the implementation of pow, but I think your problem is really simply the fact that you are using pow which gives an imprecise floating point result (because it is implemented as something like exp(power * log(base)); and log(base) is never going to be absolutely accurate [unless base is a power of e].

pow function in c++ is not working like it should [duplicate]

Consider the following piece of code:
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
It outputs "122" instead of "123". Is it a bug in g++ 4.7.2 (MinGW, Windows XP)?
std::pow() works with floating point numbers, which do not have infinite precision, and probably the implementation of the Standard Library you are using implements pow() in a (poor) way that makes this lack of infinite precision become relevant.
However, you could easily define your own version that works with integers. In C++11, you can even make it constexpr (so that the result could be computed at compile-time when possible):
constexpr int int_pow(int b, int e)
{
return (e == 0) ? 1 : b * int_pow(b, e - 1);
}
Here is a live example.
Tail-recursive form (credits to Dan Nissenbaum):
constexpr int int_pow(int b, int e, int res = 1)
{
return (e == 0) ? res : int_pow(b, e - 1, b * res);
}
All the other answers so far miss or dance around the one and only problem in the question:
The pow in your C++ implementation is poor quality. It returns an inaccurate answer when there is no need to.
Get a better C++ implementation, or at least replace the math functions in it. The one pointed to by Pascal Cuoq is good.
Not with mine at least:
$ g++ --version | head -1
g++ (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
$ ./a.out
123
IDEone is also running version 4.7.2 and gives 123.
Signatures of pow() from http://www.cplusplus.com/reference/cmath/pow/
double pow ( double base, double exponent );
long double pow ( long double base, long double exponent );
float pow ( float base, float exponent );
double pow ( double base, int exponent );
long double pow ( long double base, int exponent );
You should set double base = 10.0; and double i = 23.0.
If you simply write
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
what do you think is pow supposed to refer to? The C++ standard does not even guarantee that after including cmath you'll have a pow function at global scope.
Keep in mind that all the overloads are at least in the std namespace. There is are pow functions that take an integer exponent and there are pow functions that take floating point exponents. It is quite possible that your C++ implementation only declares the C pow function at global scope. This function takes a floating point exponent. The thing is that this function is likely to have a couple of approximation and rounding errors. For example, one possible way of implementing that function is:
double pow(double base, double power)
{
return exp(log(base)*power);
}
It's quite possible that pow(10.0,2.0) yields something like 99.99999999992543453265 due to rounding and approximation errors. Combined with the fact that floating point to integer conversion yields the number before the decimal point this explains your result of 122 because 99+3=122.
Try using an overload of pow which takes an integer exponent and/or do some proper rounding from float to int. The overload taking an integer exponent might give you the exact result for 10 to the 2nd power.
Edit:
As you pointed out, trying to use the std::pow(double,int) overload also seems to yield a value slightly less 100. I took the time to check the ISO standards and the libstdc++ implementation to see that starting with C++11 the overloads taking integer exponents have been dropped as a result of resolving defect report 550. Enabling C++0x/C++11 support actually removes the overloads in the libstdc++ implementation which could explain why you did not see any improvement.
Anyhow, it is probably a bad idea to rely on the accuracy of such a function especially if a conversion to integer is involved. A slight error towards zero will obviously make a big difference if you expect a floating point value that is an integer (like 100) and then convert it to an int-type value. So my suggestion would be write your own pow function that takes all integers or take special care with respect to the double->int conversion using your own round function so that a slight error torwards zero does not change the result.
Your problem is not a bug in gcc, that's absolutely certain. It may be a bug in the implementation of pow, but I think your problem is really simply the fact that you are using pow which gives an imprecise floating point result (because it is implemented as something like exp(power * log(base)); and log(base) is never going to be absolutely accurate [unless base is a power of e].

C++ - Difference between float and double? [duplicate]

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.
There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)
Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.

Why is my integer math with std::pow giving the wrong answer?

Consider the following piece of code:
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
It outputs "122" instead of "123". Is it a bug in g++ 4.7.2 (MinGW, Windows XP)?
std::pow() works with floating point numbers, which do not have infinite precision, and probably the implementation of the Standard Library you are using implements pow() in a (poor) way that makes this lack of infinite precision become relevant.
However, you could easily define your own version that works with integers. In C++11, you can even make it constexpr (so that the result could be computed at compile-time when possible):
constexpr int int_pow(int b, int e)
{
return (e == 0) ? 1 : b * int_pow(b, e - 1);
}
Here is a live example.
Tail-recursive form (credits to Dan Nissenbaum):
constexpr int int_pow(int b, int e, int res = 1)
{
return (e == 0) ? res : int_pow(b, e - 1, b * res);
}
All the other answers so far miss or dance around the one and only problem in the question:
The pow in your C++ implementation is poor quality. It returns an inaccurate answer when there is no need to.
Get a better C++ implementation, or at least replace the math functions in it. The one pointed to by Pascal Cuoq is good.
Not with mine at least:
$ g++ --version | head -1
g++ (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2)
$ ./a.out
123
IDEone is also running version 4.7.2 and gives 123.
Signatures of pow() from http://www.cplusplus.com/reference/cmath/pow/
double pow ( double base, double exponent );
long double pow ( long double base, long double exponent );
float pow ( float base, float exponent );
double pow ( double base, int exponent );
long double pow ( long double base, int exponent );
You should set double base = 10.0; and double i = 23.0.
If you simply write
#include <iostream>
#include <cmath>
int main() {
int i = 23;
int j = 1;
int base = 10;
int k = 2;
i += j * pow(base, k);
std::cout << i << std::endl;
}
what do you think is pow supposed to refer to? The C++ standard does not even guarantee that after including cmath you'll have a pow function at global scope.
Keep in mind that all the overloads are at least in the std namespace. There is are pow functions that take an integer exponent and there are pow functions that take floating point exponents. It is quite possible that your C++ implementation only declares the C pow function at global scope. This function takes a floating point exponent. The thing is that this function is likely to have a couple of approximation and rounding errors. For example, one possible way of implementing that function is:
double pow(double base, double power)
{
return exp(log(base)*power);
}
It's quite possible that pow(10.0,2.0) yields something like 99.99999999992543453265 due to rounding and approximation errors. Combined with the fact that floating point to integer conversion yields the number before the decimal point this explains your result of 122 because 99+3=122.
Try using an overload of pow which takes an integer exponent and/or do some proper rounding from float to int. The overload taking an integer exponent might give you the exact result for 10 to the 2nd power.
Edit:
As you pointed out, trying to use the std::pow(double,int) overload also seems to yield a value slightly less 100. I took the time to check the ISO standards and the libstdc++ implementation to see that starting with C++11 the overloads taking integer exponents have been dropped as a result of resolving defect report 550. Enabling C++0x/C++11 support actually removes the overloads in the libstdc++ implementation which could explain why you did not see any improvement.
Anyhow, it is probably a bad idea to rely on the accuracy of such a function especially if a conversion to integer is involved. A slight error towards zero will obviously make a big difference if you expect a floating point value that is an integer (like 100) and then convert it to an int-type value. So my suggestion would be write your own pow function that takes all integers or take special care with respect to the double->int conversion using your own round function so that a slight error torwards zero does not change the result.
Your problem is not a bug in gcc, that's absolutely certain. It may be a bug in the implementation of pow, but I think your problem is really simply the fact that you are using pow which gives an imprecise floating point result (because it is implemented as something like exp(power * log(base)); and log(base) is never going to be absolutely accurate [unless base is a power of e].

What is the difference between float and double?

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.
There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)
Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.