Conversion from float to double in c++ - c++

I wrote a code to calculate error function or double erf(double x). It uses lots of constants in calculations which uses double as well. However, the requirement is to write a code float format or float erf(float). I have to maintain 6 decimals accuracy (typically for float).
When I converted erf(x) into float erf( double x), the results are still the same and accurate. However when I convert x to float or float erf(float x) I am getting some significant errors in small values of x.
Is there a way to convert float to double for x so that precision is still maintained within the code of erf(x)? My intuition tells me that my erf code is good only for double value numbers.

You can't convert from float to double and except that float will have the same precision of double.
With double you get double the precision of a float
Note that in C++ you have erf: http://en.cppreference.com/w/cpp/numeric/math/erf

Inside float erf(float x) you can cast the value of x to double at points where precision exceeding float is required.
float demoA(float x)
{
return x*x*x-1;
}
float demoB(float x)
{
return static_cast<double>(x)*x*x - 1;
}
In this case, demoB will return a much better value than demoA if the paramerter is close to one. The conversion of the first operator of the multiplication to double is enough, because it causes promotion of the other operand.

Related

Is there any difference between using floating point casts vs floating point suffixes in C and C++?

Is there a difference between this (using floating point literal suffixes):
float MY_FLOAT = 3.14159265358979323846264338328f; // f suffix
double MY_DOUBLE = 3.14159265358979323846264338328; // no suffix
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
vs this (using floating point casts):
float MY_FLOAT = (float)3.14159265358979323846264338328;
double MY_DOUBLE = (double)3.14159265358979323846264338328;
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
in C and C++?
Note: the same would go for function calls:
void my_func(long double value);
my_func(3.14159265358979323846264338328L);
// vs
my_func((long double)3.14159265358979323846264338328);
// etc.
Related:
What's the C++ suffix for long double literals?
https://en.cppreference.com/w/cpp/language/floating_literal
The default is double. Assuming IEEE754 floating point, double is a strict superset of float, and thus you will never lose precision by not specifying f. EDIT: this is only true when specifying values that can be represented by float. If rounding occurs this might not be strictly true due to having rounding twice, see Eric Postpischil's answer. So you should also use the f suffix for floats.
This example is also problematic:
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
This first gives a double constant which is then converted to long double. But because you started with a double you have already lost precision that will never come back. Therefore, if you want to use full precision in long double constants you must use the L suffix:
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
There is a difference between using a suffix and a cast; 8388608.5000000009f and (float) 8388608.5000000009 have different values in common C implementations. This code:
#include <stdio.h>
int main(void)
{
float x = 8388608.5000000009f;
float y = (float) 8388608.5000000009;
printf("%.9g - %.9g = %.9g.\n", x, y, x-y);
}
prints “8388609 - 8388608 = 1.” in Apple Clang 11.0 and other implementations that use correct rounding with IEEE-754 binary32 for float and binary64 for double. (The C standard permits implementations to use methods other than IEEE-754 correct rounding, so other C implementations may have different results.)
The reason is that (float) 8388608.5000000009 contains two rounding operations. With the suffix, 8388608.5000000009f is converted directly to float, so the portion that must be discarded in order to fit in a float, .5000000009, is directly examined in order to see whether it is greater than .5 or not. It is, so the result is rounded up to the next representable value, 8388609.
Without the suffix, 8388608.5000000009 is first converted to double. When the portion that must be discarded, .0000000009, is considered, it is found to be less than ½ the low bit at the point of truncation. (The value of the low bit there is .00000000186264514923095703125, and half of it is .000000000931322574615478515625.) So the result is rounded down, and we have 8388608.5 as a double. When the cast rounds this to float, the portion that must be discarded is .5, which is exactly halfway between the representable numbers 8388608 and 8388609. The rule for breaking ties rounds it to the value with the even low bit, 8388608.
(Another example is “7.038531e-26”; (float) 7.038531e-26 is not equal to 7.038531e-26f. This is the only such numeral with fewer than eight significant digits when float is binary32 and double is binary64, except of course “-7.038531e-26”.)
While you do not lose precision if you omit the f in a float constant, there can be surprises in so doing.
Consider this:
#include <stdio.h>
#define DCN 0.1
#define FCN 0.1f
int main( void)
{
float f = DCN;
printf( "DCN\t%s\n", f > DCN ? "more" : "not-more");
float g = FCN;
printf( "FCN\t%s\n", g > FCN ? "more" : "not-more");
return 0;
}
This (compiled with gcc 9.1.1) produces the output
DCN more
FCN not-more
The explanation is that in f > DCN the compiler takes DCN to have type double and so promotes f to a double, and
(double)(float)0.1 > 0.1
Personally on the (rare) occasions when I need float constants, I always use a 'f' suffix.

Multiplying floats and keep/get double precision accuracy

I have a function that takes floats, I'm doing some computation with them, and I'd like to keep as much accuracy as possible in the returned result. I read that when you multiply two floats, you double the number of significant digits.
So when two floats get multiplied, for example float e, f; and I do double g = e * f, when do the bits get truncated?
In my example function below, do I need casting, and if yes, where? This is in a tight inner loop, if I put static_cast<double>(x) around each variable a b c d where it's used, I get 5-10% slowdown. But I suspect I don't need to cast each variable separately, and only in some locations, if at all? Or does returning a double here do not give me any gain anyway and I can as well just return a float?
double func(float a, float b, float c, float d) {
return (a - b) * c + (a - c) * b;
}
When you multiply two floats without casting, the result is calculated with float precision (i.e. truncated) and then converted to double.
To calculate the result in double, you need to cast at least one operand to double first. Then the entire calculation will be done in double (and all float values will be converted). However, that will create the same slowdown. The slowdown is likely because converting a number from float to double is not entirely trivial (different bit size and range of exponent and mantisa).
If I'd be doing that and have control over the function definition, I'd pass all the arguments as double (I generally use double everywhere, on modern computers the speed difference between calculating in float vs double is negligible, only issues could be memory throughput and cache performance when operating on large arrays of values).
Btw. the case important for precision actually isn't the multiplication, but the addition/subtraction - that is where the precision can make a big difference. Consider adding/subtracting 1e+6 and 1e-3.
Meaning is more important than 5-10% slowdown. What I'd do:
double func_impl(double a, double b, double c, double d) {
return (a - b) * c + (a - c) * b;
}
double func(float a, float b, float c, float d) {
return func_impl(a, b, c, d);
}
I'd choose this even if it's a bit slower, because it expresses the idea that you want double precision in your calculations well and just need the floats on the interface; while it keeps the body of your function separate from the casting (the latter being done in one step).

How does Clojure on the JVM convert from a float to a double with trailing garbage digits?

I run:
(double (float 3.14159))
and I get:
3.141590118408203
If I run it down to:
(double (float 3.141))
I get:
3.1410000324249268
By what method is this conversion happening?
This is related to general precision problem with floating point operations; floating points on computers do not map directly to decimal numbers, so standard decimal display of floats and doubles is rounded. In other words; the float and double here have exactly the same value, but when displayed they are rounded differently.
=> (= (double (float 3.14159)) (float 3.14159))
true
On the JVM, Why converting from float to double changes the value? and Convert float to double without losing precision may help.

Notation of double floating point values in C and C++

What's the notation for double precision floating point values in C/C++?
.5 is representing a double or a float value?
I'm pretty sure 2.0f is parsed as a float and 2.0 as a double but what about .5?
http://c.comsci.us/etymology/literals.html
It's double. Suffix it with f to get float.
And here is the link to reference document: http://en.cppreference.com/w/cpp/language/floating_literal
Technically, initializing a float with a double constant can lead to a different result (i.e. cumulate 2 round off errors) than initializing with a float constant.
Here is an example:
#include <stdio.h>
int main() {
double d=8388609.499999999068677425384521484375;
float f1=8388609.499999999068677425384521484375f;
float f2=8388609.499999999068677425384521484375;
float f3=(float) d;
printf("f1=%f f2=%f f3=%f\n",f1,f2,f3);
}
with gcc 4.2.1 i686 I get
f1=8388609.000000 f2=8388610.000000 f3=8388610.000000
The constant is exactly in base 2:
100000000000000000000001.011111111111111111111111111111
Base 2 representation requires 54 bits, double only have 53. So when converted to double, it is rounded to nearest double, tie to even, thus to:
100000000000000000000001.10000000000000000000000000000
Base 2 representation requires 25 bits, float only have 24, so if you convert this double to a float, then another rounding occur to nearest float, tie to even, thus to:
100000000000000000000010.
If you convert the first number directly to a float, the single rounding is different:
100000000000000000000001.
As we can see, when initializing f2, gcc convert the decimal representation to a double, then to a float (it would be interesting to check if the behaviour is determined by a standard).
Though, as this is a specially crafted number, most of the time you shouldn't encounter such difference.

Division of double with a float

I have a doubt about precision and speed in division between double and float.
e.g:
double a;
a=myfun(); //returns a number with lots of decimals
float b=5.0;
double result=a/b;
Would the result change if b would be double?
Does it take more time to compute if they are not doubles (because of changing the size of the float for fitting the double size)?
The time difference between conversion from float to double or double to float is really negligible
Check out this link it will surely help you.
Would the result change if b would be double?
Since the value is 0.5, the result should not change. If it was a different value, it might change, because double has better precision then float.
Does it take more time to compute if they are not doubles?
Yes, it does. But the time to convert from float to double can be really neglected.
have you tried doing this? b is converted to double during division anyway. Floating point divisions are expensive, and time taken for divisions of float is slightly faster.