Multiplying floats and keep/get double precision accuracy

Multiplying floats and keep/get double precision accuracy - c++

I have a function that takes floats, I'm doing some computation with them, and I'd like to keep as much accuracy as possible in the returned result. I read that when you multiply two floats, you double the number of significant digits.
So when two floats get multiplied, for example float e, f; and I do double g = e * f, when do the bits get truncated?
In my example function below, do I need casting, and if yes, where? This is in a tight inner loop, if I put static_cast<double>(x) around each variable a b c d where it's used, I get 5-10% slowdown. But I suspect I don't need to cast each variable separately, and only in some locations, if at all? Or does returning a double here do not give me any gain anyway and I can as well just return a float?
double func(float a, float b, float c, float d) {
return (a - b) * c + (a - c) * b;
}

When you multiply two floats without casting, the result is calculated with float precision (i.e. truncated) and then converted to double.
To calculate the result in double, you need to cast at least one operand to double first. Then the entire calculation will be done in double (and all float values will be converted). However, that will create the same slowdown. The slowdown is likely because converting a number from float to double is not entirely trivial (different bit size and range of exponent and mantisa).
If I'd be doing that and have control over the function definition, I'd pass all the arguments as double (I generally use double everywhere, on modern computers the speed difference between calculating in float vs double is negligible, only issues could be memory throughput and cache performance when operating on large arrays of values).
Btw. the case important for precision actually isn't the multiplication, but the addition/subtraction - that is where the precision can make a big difference. Consider adding/subtracting 1e+6 and 1e-3.

Meaning is more important than 5-10% slowdown. What I'd do:
double func_impl(double a, double b, double c, double d) {
return (a - b) * c + (a - c) * b;
}
double func(float a, float b, float c, float d) {
return func_impl(a, b, c, d);
}
I'd choose this even if it's a bit slower, because it expresses the idea that you want double precision in your calculations well and just need the floats on the interface; while it keeps the body of your function separate from the casting (the latter being done in one step).

Related

C++ - Difference between float and double? [duplicate]

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?

Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.

Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)

A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.

I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.

There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types

The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.

Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.

Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.

When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.

The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.

If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.

Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).
But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.
The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.
As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.
And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.
(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)

Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.

Conversion from float to double in c++

I wrote a code to calculate error function or double erf(double x). It uses lots of constants in calculations which uses double as well. However, the requirement is to write a code float format or float erf(float). I have to maintain 6 decimals accuracy (typically for float).
When I converted erf(x) into float erf( double x), the results are still the same and accurate. However when I convert x to float or float erf(float x) I am getting some significant errors in small values of x.
Is there a way to convert float to double for x so that precision is still maintained within the code of erf(x)? My intuition tells me that my erf code is good only for double value numbers.

You can't convert from float to double and except that float will have the same precision of double.
With double you get double the precision of a float
Note that in C++ you have erf: http://en.cppreference.com/w/cpp/numeric/math/erf

Inside float erf(float x) you can cast the value of x to double at points where precision exceeding float is required.
float demoA(float x)
{
return x*x*x-1;
}
float demoB(float x)
{
return static_cast<double>(x)*x*x - 1;
}
In this case, demoB will return a much better value than demoA if the paramerter is close to one. The conversion of the first operator of the multiplication to double is enough, because it causes promotion of the other operand.

Division of double with a float

I have a doubt about precision and speed in division between double and float.
e.g:
double a;
a=myfun(); //returns a number with lots of decimals
float b=5.0;
double result=a/b;
Would the result change if b would be double?
Does it take more time to compute if they are not doubles (because of changing the size of the float for fitting the double size)?

The time difference between conversion from float to double or double to float is really negligible
Check out this link it will surely help you.

Would the result change if b would be double?
Since the value is 0.5, the result should not change. If it was a different value, it might change, because double has better precision then float.
Does it take more time to compute if they are not doubles?
Yes, it does. But the time to convert from float to double can be really neglected.

have you tried doing this? b is converted to double during division anyway. Floating point divisions are expensive, and time taken for divisions of float is slightly faster.

How do I set precision while using type double / float in C++

In my previous question Comparing a double and int, without casting or conversion, I found out how the the difference between two doubles was tripping the comparison.
I came accross the method setprecision(), which will help display all the numbers after decimal.
So, the difference of 6.15 and 3.15 was found to be : 3.00000000000000044408920985006
Now, when it gets compared with 3, it returns a result saying it is greater than 3.
How do I force it to take only a limited number of digits?
When I used 6.1 and 3.1, the difference was : 2.99999999999999955591079014994
How should I make the precision so that we know that it is actually equal to 3, and not less than.

Hopefully you should be knowing that floating/double cannot be exactly represented in binary and truncation happens because of the recurring decimal. Your comparison with a float/double with an integer will always fail.
Even your using setprecision will not work because its a method of iomanip to set the precision of the display and not the value being stored.
The portable way of comparing doubles is not to use the '==' operator but to do something like
bool Compare(double a,double b) {
std::fabs(a - b) < std::numeric_limits<double>::epsilon();
}
You can use this to compare double with float and or integer. You can also write a similar compare function for float
bool Compare(float a,float b) {
std::fabs(a - b) < std::numeric_limits<float>::epsilon();
}

In a comment on your other question, you were already pointed towards this great paper on floating-point numbers. It's well worth a read.
With reference to your particular question, a standard way is to define a tolerance with which comparison between doubles is to be made. For example if you have two doubles a and b, and wish to determine whether a is larger than b within a tolerance of eps (another double), you might do something like:
if (a - b > eps) {
// a is greater than b
} else {
// a is not greater than b
}
Alternatively, if you want to know that a is equal to b within the tolerance specified by eps, you might do something of this sort:
if (std::abs(a - b) <= eps) {
// a and b are equal within the set tolerance
} else {
// a and b are not equal within the set tolerance
}
As pointed out by others, C++ comes with some helpful functions out of the box for performing these sorts of comparisons. Look at std::abs, std::numeric_limits, and this nice post on SO.

Here's a comparison function that determines if two numbers are within one LSB of each other.
bool Compare(double a, double b)
{
return (a <= std::nextafter(b, abs(1.1*b))) && (b <= std::nextafter(a, abs(1.1*a)));
}
std::nextafter is new to C++11, but versions are available in earlier compilers. See Generate next largest or smallest representable floating point number without bit twiddling

setprecision lets you select how many digits you spit out to a stream. It does not decide the number of digits to be considered. For rounding purposes use one of the rounding functions from <cmath>.

The no of digits of precision in float and double actually depends on the size of them respectively.That is why float generally has less precision than double.
You can use
std::cout<<std::setprecision(desired no);
float a=(desired no);
Now you have sucessfully set the precision of float.The same can be done to other appropriate data type including double as well.
Warning do not set precision greater than what a data type has to offer.Double has 15 to 18 digits of precision while float has only 6 to 9 digits of precision.

What is the difference between float and double?

I've read about the difference between double precision and single precision. However, in most cases, float and double seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?

Huge difference.
As the name implies, a double has 2x the precision of float[1]. In general a double has 15 decimal digits of precision, while float has 7.
Here's how the number of digits are calculated:
double has 52 mantissa bits + 1 hidden bit: log(253)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(224)÷log(10) = 7.22 digits
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.7g\n", b); // prints 9.000023
while
double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
b += a;
printf("%.15g\n", b); // prints 8.99999999999996
Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.
During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.
Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.
[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).

Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
There are three floating point types: float, double, and long double. The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double.
The C++ standard adds:
The value representation of floating-point types is implementation-defined.
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.

Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using float and double, we can write a test program:
#include <stdio.h>
#include <math.h>
void dbl_solve(double a, double b, double c)
{
double d = b*b - 4.0*a*c;
double sd = sqrt(d);
double r1 = (-b + sd) / (2.0*a);
double r2 = (-b - sd) / (2.0*a);
printf("%.5f\t%.5f\n", r1, r2);
}
void flt_solve(float a, float b, float c)
{
float d = b*b - 4.0f*a*c;
float sd = sqrtf(d);
float r1 = (-b + sd) / (2.0f*a);
float r2 = (-b - sd) / (2.0f*a);
printf("%.5f\t%.5f\n", r1, r2);
}
int main(void)
{
float fa = 1.0f;
float fb = -4.0000000f;
float fc = 3.9999999f;
double da = 1.0;
double db = -4.0000000;
double dc = 3.9999999;
flt_solve(fa, fb, fc);
dbl_solve(da, db, dc);
return 0;
}
Running the program gives me:
2.00000 2.00000
2.00032 1.99968
Note that the numbers aren't large, but still you get cancellation effects using float.
(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)

A double is 64 and single precision
(float) is 32 bits.
The double has a bigger mantissa (the integer bits of the real number).
Any inaccuracies will be smaller in the double.

I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
#include <iostream>
#include <iomanip>
int main(){
for(float t=0;t<1;t+=0.01){
std::cout << std::fixed << std::setprecision(6) << t << std::endl;
}
}
The output is
0.000000
0.010000
0.020000
0.030000
0.040000
0.050000
0.060000
0.070000
0.080000
0.090000
0.100000
0.110000
0.120000
0.130000
0.140000
0.150000
0.160000
0.170000
0.180000
0.190000
0.200000
0.210000
0.220000
0.230000
0.240000
0.250000
0.260000
0.270000
0.280000
0.290000
0.300000
0.310000
0.320000
0.330000
0.340000
0.350000
0.360000
0.370000
0.380000
0.390000
0.400000
0.410000
0.420000
0.430000
0.440000
0.450000
0.460000
0.470000
0.480000
0.490000
0.500000
0.510000
0.520000
0.530000
0.540000
0.550000
0.560000
0.570000
0.580000
0.590000
0.600000
0.610000
0.620000
0.630000
0.640000
0.650000
0.660000
0.670000
0.680000
0.690000
0.700000
0.710000
0.720000
0.730000
0.740000
0.750000
0.760000
0.770000
0.780000
0.790000
0.800000
0.810000
0.820000
0.830000
0.839999
0.849999
0.859999
0.869999
0.879999
0.889999
0.899999
0.909999
0.919999
0.929999
0.939999
0.949999
0.959999
0.969999
0.979999
0.989999
0.999999
As you can see after 0.83, the precision runs down significantly.
However, if I set up t as double, such an issue won't happen.
It took me five hours to realize this minor error, which ruined my program.

There are three floating point types:
float
double
long double
A simple Venn diagram will explain about:
The set of values of the types

The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.

Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.

Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.

When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.

The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.

If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.

Unlike an int (whole number), a float have a decimal point, and so can a double.
But the difference between the two is that a double is twice as detailed as a float, meaning that it can have double the amount of numbers after the decimal point.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js