Notation of double floating point values in C and C++ - c++

What's the notation for double precision floating point values in C/C++?
.5 is representing a double or a float value?
I'm pretty sure 2.0f is parsed as a float and 2.0 as a double but what about .5?
http://c.comsci.us/etymology/literals.html

It's double. Suffix it with f to get float.
And here is the link to reference document: http://en.cppreference.com/w/cpp/language/floating_literal

Technically, initializing a float with a double constant can lead to a different result (i.e. cumulate 2 round off errors) than initializing with a float constant.
Here is an example:
#include <stdio.h>
int main() {
double d=8388609.499999999068677425384521484375;
float f1=8388609.499999999068677425384521484375f;
float f2=8388609.499999999068677425384521484375;
float f3=(float) d;
printf("f1=%f f2=%f f3=%f\n",f1,f2,f3);
}
with gcc 4.2.1 i686 I get
f1=8388609.000000 f2=8388610.000000 f3=8388610.000000
The constant is exactly in base 2:
100000000000000000000001.011111111111111111111111111111
Base 2 representation requires 54 bits, double only have 53. So when converted to double, it is rounded to nearest double, tie to even, thus to:
100000000000000000000001.10000000000000000000000000000
Base 2 representation requires 25 bits, float only have 24, so if you convert this double to a float, then another rounding occur to nearest float, tie to even, thus to:
100000000000000000000010.
If you convert the first number directly to a float, the single rounding is different:
100000000000000000000001.
As we can see, when initializing f2, gcc convert the decimal representation to a double, then to a float (it would be interesting to check if the behaviour is determined by a standard).
Though, as this is a specially crafted number, most of the time you shouldn't encounter such difference.

Related

What is the need of suffix 'f' when define the variable to be float type in c++? [duplicate]

Can there be a difference in bit-representation between a direct assignment of a floating point literal float x = 3.2f; and a double implicitly converted to a float float x2 = 3.2;?
I.e. is
#define EQUAL(FLOAT_LITERAL)\
FLOAT_LITERAL##f == static_cast<float>(FLOAT_LITERAL)
EQUAL(3.2) && EQUAL(55.6200093490) // etc ...
true for all floating point literals?
I ask this question, because clang or gcc do not complain about narrowing conversions if numbers are in the value range of float:
Warning is enabled with -Wnarrowing:
float f {3.422222222222222222222222222222246454}; // no warning/ error although it should definitely lose precision
float f2 {static_cast<double>(std::numeric_limits<float>::max()) + 1.0}; // no warning/ error
float f3 {3.5e38}; // error: narrowing conversion of '3.5e+38' from 'double' to 'float' inside { } [-Wnarrowing]
It is great that the compiler does actual range checks, but is that sufficient?
Assuming IEEE 754, float as 32 bit binary, double as 64 bit binary.
There are decimal fractions that round differently, under IEEE 754 round-to-nearest rules, if converted directly from decimal to float from the result of first converting from decimal to double and then to float.
For example, consider 1.0000000596046447753906250000000000000000000000000001
1.000000059604644775390625 is exactly representable as a double and is exactly half way between 1.0 and 1.00000011920928955078125, the value of the smallest float greater than 1.0. 1.0000000596046447753906250000000000000000000000000001 rounds up to 1.00000011920928955078125 if converted directly, because it is greater than the mid point. If it is first converted to 64 bit, round to nearest takes it to the mid point 1.000000059604644775390625, and then round half even rounds down to 1.0.
The answer given by Patricia is correct. But we generally don't type such number, so maybe it's not a problem... Unless it happens with some shorter decimal literals?
I have illustrated that once in the comments following that answer Count number of digits after `.` in floating point numbers?
The decimal value 7.038531e-26 is approximately 0x1.5C87FAFFFFFFFCE4F6700...p-21 the nearest double is 0x1.5C87FB0000000p-21 and the nearest float is 0x1.5C87FAp-21.
Note that 0x1.5C87FA0000000p-21 is the nearest double to 7.038530691851209e-26
So yes, there can be a double-rounding problem (round off error twice in the same direction) with a relatively short literal...
float x = 7.038531e-26f; and float y = 7.038531e-26; should be two different numbers if the compiler rounds the literals correctly.
Is literal double to float conversion equal to float literal?
Usually yes, but not always equal.
Code to double then to float (vs. code to float) potentially incurs double rounding trouble.
Problem seen with any code value (even possible without a fraction) that is near the half-way case of two adjacent float values.
Occurrence: With random constants and typical float/double: about 1 in 230.

Is there any difference between using floating point casts vs floating point suffixes in C and C++?

Is there a difference between this (using floating point literal suffixes):
float MY_FLOAT = 3.14159265358979323846264338328f; // f suffix
double MY_DOUBLE = 3.14159265358979323846264338328; // no suffix
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
vs this (using floating point casts):
float MY_FLOAT = (float)3.14159265358979323846264338328;
double MY_DOUBLE = (double)3.14159265358979323846264338328;
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
in C and C++?
Note: the same would go for function calls:
void my_func(long double value);
my_func(3.14159265358979323846264338328L);
// vs
my_func((long double)3.14159265358979323846264338328);
// etc.
Related:
What's the C++ suffix for long double literals?
https://en.cppreference.com/w/cpp/language/floating_literal
The default is double. Assuming IEEE754 floating point, double is a strict superset of float, and thus you will never lose precision by not specifying f. EDIT: this is only true when specifying values that can be represented by float. If rounding occurs this might not be strictly true due to having rounding twice, see Eric Postpischil's answer. So you should also use the f suffix for floats.
This example is also problematic:
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
This first gives a double constant which is then converted to long double. But because you started with a double you have already lost precision that will never come back. Therefore, if you want to use full precision in long double constants you must use the L suffix:
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
There is a difference between using a suffix and a cast; 8388608.5000000009f and (float) 8388608.5000000009 have different values in common C implementations. This code:
#include <stdio.h>
int main(void)
{
float x = 8388608.5000000009f;
float y = (float) 8388608.5000000009;
printf("%.9g - %.9g = %.9g.\n", x, y, x-y);
}
prints “8388609 - 8388608 = 1.” in Apple Clang 11.0 and other implementations that use correct rounding with IEEE-754 binary32 for float and binary64 for double. (The C standard permits implementations to use methods other than IEEE-754 correct rounding, so other C implementations may have different results.)
The reason is that (float) 8388608.5000000009 contains two rounding operations. With the suffix, 8388608.5000000009f is converted directly to float, so the portion that must be discarded in order to fit in a float, .5000000009, is directly examined in order to see whether it is greater than .5 or not. It is, so the result is rounded up to the next representable value, 8388609.
Without the suffix, 8388608.5000000009 is first converted to double. When the portion that must be discarded, .0000000009, is considered, it is found to be less than ½ the low bit at the point of truncation. (The value of the low bit there is .00000000186264514923095703125, and half of it is .000000000931322574615478515625.) So the result is rounded down, and we have 8388608.5 as a double. When the cast rounds this to float, the portion that must be discarded is .5, which is exactly halfway between the representable numbers 8388608 and 8388609. The rule for breaking ties rounds it to the value with the even low bit, 8388608.
(Another example is “7.038531e-26”; (float) 7.038531e-26 is not equal to 7.038531e-26f. This is the only such numeral with fewer than eight significant digits when float is binary32 and double is binary64, except of course “-7.038531e-26”.)
While you do not lose precision if you omit the f in a float constant, there can be surprises in so doing.
Consider this:
#include <stdio.h>
#define DCN 0.1
#define FCN 0.1f
int main( void)
{
float f = DCN;
printf( "DCN\t%s\n", f > DCN ? "more" : "not-more");
float g = FCN;
printf( "FCN\t%s\n", g > FCN ? "more" : "not-more");
return 0;
}
This (compiled with gcc 9.1.1) produces the output
DCN more
FCN not-more
The explanation is that in f > DCN the compiler takes DCN to have type double and so promotes f to a double, and
(double)(float)0.1 > 0.1
Personally on the (rare) occasions when I need float constants, I always use a 'f' suffix.

C++ - Difference between float and double? [duplicate]

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.
There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)
Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.

Why is 10000000000000000 != 10000000000000000?

To begin with, take a look at the following code in Visual Studio using C++:
float a = 10000000000000000.0;
float b = a - 10000000000000000.0;
When printing them out, it turns out:
a = 10000000272564224.000000
b = 272564224.000000
And when viewing them in Watch under Debug, it turns out:
-Name- -Value- -Type-
a 1.0000000e+016 float
b 2.7256422e+008 float
Pre-question: I am sure that 10000000000000000.0 is within the range of float. Why is that we cannot get correct a/ b using float?
Followup-question:
For pre-question, based on all great following answers. I know that the reason is that a 32-bit float has an accuracy of about 7 digits, so beyond the first 6-7 digits, all bets are off. That's why the math doesn't work out, and printing looks wrong for these large numbers. I have to use double for more accuracy. So why float claims to be able to handle large numbers and at the same time we cannot trust it?
The huge number you are using is indeed within the "range" of float, but not all its digits are within the "accuracy" of float. A 32-bit float has an accuracy of about 7 digits, so beyond the first 6-7 digits, all bets are off. That's why the math doesn't work out, and printing looks "wrong" when you use these large numbers. If you want more accuracy, use double. For more, see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers
A float number takes about 6-7 decimal places (23 bit for the fraction) so any number with more decimal places is just an approximation. Which leads to that rondom number.
For more about floating point format precision: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
For the updated question:
You should never use floating point format when the precision is required. We can't just specify larger space of memory. Handling numbers with very large amount of decimal places needs a very large amount of memory .So more complicated methods are used instead ( for exemple using a string format then processing the characters successively) .
To avoid that problem use double which gives about 16-17 decimal places (52 bit for the fraction) or long double for even more precision.
#include <stdio.h>
int main()
{
double a = 10000000000000000.0;
double b = a - 10000000000000000.0;
printf("%f\n%f", a, b);
}
exemple http://ideone.com/rJN1QI
Your confusion is caused by implicit conversions and lack of accuracy of float.
Let me fill in the implicit conversions for you:
float a = (float)10000000000000000.0;
float b = (float)((double)a - 10000000000000000.0);
This converts the literal double to float, and the closest it can get is 10000000272564224. And then the subtraction is performed using double, not float, so the second 10000000000000000.0 does not lose precision.
We can use the nextafter function to get a better idea of the precision of floating-point types. nextafter takes two arguments; it returns the adjacent representable number to its first argument, in the direction of its second argument.
The value 10000000000000000.0 (or 1.0e16) is well within the range of representable values of type float, but that value itself cannot be represented exactly.
Here's a small program that illustrates the issue:
#include <math.h>
#include <stdio.h>
int main()
{
float a = 10000000000000000.0;
double d_a = 10000000000000000.0;
printf(" %20.2f\n", nextafterf(a, 0.0f));
printf("a = %20.2f\n", a);
printf(" %20.2f\n", nextafterf(a, 1.0e30f));
putchar('\n');
printf(" %20.2f\n", nextafter(d_a, 0.0));
printf("d_a = %20.2f\n", d_a);
printf(" %20.2f\n", nextafter(d_a, 1.0e30));
putchar('\n');
}
and here's its output on my system:
9999999198822400.00
a = 10000000272564224.00
10000001346306048.00
9999999999999998.00
d_a = 10000000000000000.00
10000000000000002.00
If you use type float, the closest you can get to 10000000000000000.00 is 10000000272564224.00.
But in your second declaration:
float b = a - 10000000000000000.0
the subtraction is done in type double; the constant 10000000000000000.0 is already of type double, and a is promoted to double to match. So this takes the poor approximation of 1.0e16 that's stored in a, and subtracts from it the much better approximation (in fact it's exact) that can be represented in type double.

What is the difference between float and double?

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.
There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)
Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.