In Fortran 90 (using gfortran on Mac OS X) if I assign a value to a double-precision variable without explicitly tacking on a kind, the precision doesn't "take." What I mean is, if I run the following program:
program sample_dp
implicit none
integer, parameter :: sp = kind(1.0)
integer, parameter :: dp = kind(1.0d0)
real(sp) :: a = 0.
real(dp) :: b = 0., c = 0., d = 0.0_dp, e = 0_dp
! assign values
a = 0.12345678901234567890
b = 0.12345678901234567890
c = DBLE(0.12345678901234567890)
d = 0.12345678901234567890_dp
write(*,101) a, b, c, d
101 format(1x, 'Single precision: ', T27, F17.15, / &
1x, 'Double precisison: ', T27, F17.15, / &
1x, 'Double precision (DBLE): ', T27, F17.15, / &
1x, 'Double precision (_dp): ', T27, F17.15)
end program
I get the result:
Single precision: 0.123456791043282
Double precision: 0.123456791043282
Double precision (DBLE): 0.123456791043282
Double precision (_dp): 0.123456789012346
The single precision result starts rounding off at the 8th decimal as expected, but only the double precision variable I assigned explicitly with _dp keeps all 16 digits of precision. This seems odd, as I would expect (I'm relatively new to Fortran) that a double precision variable would automatically be double-precision. Is there a better way to assign double precision variables, or do I have to explicitly type them as above?
A real which isn't marked as double precision will be assumed to be single precision. Just because sometime later you assign it to a double precision variable, or convert it to double precision, that doesn't mean that the value will 'magically' be double precision. It doesn't look ahead to see how the value will be used.
There are several questions linking here so it is good to state some details more explicitly with examples, especially for beginners.
As stated by MRAB in his correct answer, an expression is always evaluated without any context, so
0.12345678901234567890
is a default (single) precision floating literal, no matter where does it appear. The same holds to floating point numbers in the exponential form
0.12345678901234567890E0
it is also a default precision number.
If one want to use a double precision constant, one can use D instead of E in the above form. Even if such a double precision constant is assigned to a default precision variable, it is first treated as a double precision number and then it is converted to default precision.
The way you are using in your question (employing the kind notation and several kind constants) is more general and more modern, but the principle is the same.
0.12345678901234567890_sp
is a number of kind sp and
0.12345678901234567890_dp
is a number of kind dp and it does not matter where do they appear.
As your example shows, it is not only about assignment. In the line
c = DBLE(0.12345678901234567890)
first the number 0.12345678901234567890 is default precision. Then it is converted to double precision by DBLE, but that is done after some of the digits are already lost. Then this new double precision number is assigned to c.
Related
Is there a difference between this (using floating point literal suffixes):
float MY_FLOAT = 3.14159265358979323846264338328f; // f suffix
double MY_DOUBLE = 3.14159265358979323846264338328; // no suffix
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
vs this (using floating point casts):
float MY_FLOAT = (float)3.14159265358979323846264338328;
double MY_DOUBLE = (double)3.14159265358979323846264338328;
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
in C and C++?
Note: the same would go for function calls:
void my_func(long double value);
my_func(3.14159265358979323846264338328L);
// vs
my_func((long double)3.14159265358979323846264338328);
// etc.
Related:
What's the C++ suffix for long double literals?
https://en.cppreference.com/w/cpp/language/floating_literal
The default is double. Assuming IEEE754 floating point, double is a strict superset of float, and thus you will never lose precision by not specifying f. EDIT: this is only true when specifying values that can be represented by float. If rounding occurs this might not be strictly true due to having rounding twice, see Eric Postpischil's answer. So you should also use the f suffix for floats.
This example is also problematic:
long double MY_LONG_DOUBLE = (long double)3.14159265358979323846264338328;
This first gives a double constant which is then converted to long double. But because you started with a double you have already lost precision that will never come back. Therefore, if you want to use full precision in long double constants you must use the L suffix:
long double MY_LONG_DOUBLE = 3.14159265358979323846264338328L; // L suffix
There is a difference between using a suffix and a cast; 8388608.5000000009f and (float) 8388608.5000000009 have different values in common C implementations. This code:
#include <stdio.h>
int main(void)
{
float x = 8388608.5000000009f;
float y = (float) 8388608.5000000009;
printf("%.9g - %.9g = %.9g.\n", x, y, x-y);
}
prints “8388609 - 8388608 = 1.” in Apple Clang 11.0 and other implementations that use correct rounding with IEEE-754 binary32 for float and binary64 for double. (The C standard permits implementations to use methods other than IEEE-754 correct rounding, so other C implementations may have different results.)
The reason is that (float) 8388608.5000000009 contains two rounding operations. With the suffix, 8388608.5000000009f is converted directly to float, so the portion that must be discarded in order to fit in a float, .5000000009, is directly examined in order to see whether it is greater than .5 or not. It is, so the result is rounded up to the next representable value, 8388609.
Without the suffix, 8388608.5000000009 is first converted to double. When the portion that must be discarded, .0000000009, is considered, it is found to be less than ½ the low bit at the point of truncation. (The value of the low bit there is .00000000186264514923095703125, and half of it is .000000000931322574615478515625.) So the result is rounded down, and we have 8388608.5 as a double. When the cast rounds this to float, the portion that must be discarded is .5, which is exactly halfway between the representable numbers 8388608 and 8388609. The rule for breaking ties rounds it to the value with the even low bit, 8388608.
(Another example is “7.038531e-26”; (float) 7.038531e-26 is not equal to 7.038531e-26f. This is the only such numeral with fewer than eight significant digits when float is binary32 and double is binary64, except of course “-7.038531e-26”.)
While you do not lose precision if you omit the f in a float constant, there can be surprises in so doing.
Consider this:
#include <stdio.h>
#define DCN 0.1
#define FCN 0.1f
int main( void)
{
float f = DCN;
printf( "DCN\t%s\n", f > DCN ? "more" : "not-more");
float g = FCN;
printf( "FCN\t%s\n", g > FCN ? "more" : "not-more");
return 0;
}
This (compiled with gcc 9.1.1) produces the output
DCN more
FCN not-more
The explanation is that in f > DCN the compiler takes DCN to have type double and so promotes f to a double, and
(double)(float)0.1 > 0.1
Personally on the (rare) occasions when I need float constants, I always use a 'f' suffix.
What I'm doing is very straightforward. Here are the relevant declarations:
USE, INTRINSIC :: ISO_Fortran_env, dp=>REAL64 !modern DOUBLE PRECISION
REAL(dp), PARAMETER :: G_H2_alpha = 1.57D+04, G_H2_beta = 5.3D+03, G_H2_gamma = 4.5D+03
REAL(dp) :: E_total_alpha, E_total_beta, E_total_gamma, P_H2_sed
Usage:
P_H2_sed = G_H2_alpha * E_total_alpha + G_H2_beta * E_total_beta * G_H2_gamma * E_total_gamma
where E_total_alpha, E_total_beta, and E_total_gamma are just running dp totals inside various loops. I ask for the nearest integer NINT(P_H2_sed) and get -2147483648, which looks like mixed-mode arithmetic. The float P_H2_sed returns 2529548272025.4888, so I would expect NINT to return 2529548272026. I didn't think it was possible to get this kind of result from an intrinsic function. I haven't seen this since my days with the old F77 compiler. I'm doing something bad, but what is the question.
NINT, by default, returns an integer with default type parameter, that usually is equivalent to int32.
An integer of this kind cannot represent a number as high as 2529548272026. The maximum representable number is 2^31-1, that is 2147483647. The result you are getting is similar to that, but is the lowest representable number, -2147483648 (equivalent o all 32 bits set to 1).
To get a result of other kind from NINT, pass an optional parameter named kind, like this: NINT(P_H2_sed, kind=int64).
To begin with, take a look at the following code in Visual Studio using C++:
float a = 10000000000000000.0;
float b = a - 10000000000000000.0;
When printing them out, it turns out:
a = 10000000272564224.000000
b = 272564224.000000
And when viewing them in Watch under Debug, it turns out:
-Name- -Value- -Type-
a 1.0000000e+016 float
b 2.7256422e+008 float
Pre-question: I am sure that 10000000000000000.0 is within the range of float. Why is that we cannot get correct a/ b using float?
Followup-question:
For pre-question, based on all great following answers. I know that the reason is that a 32-bit float has an accuracy of about 7 digits, so beyond the first 6-7 digits, all bets are off. That's why the math doesn't work out, and printing looks wrong for these large numbers. I have to use double for more accuracy. So why float claims to be able to handle large numbers and at the same time we cannot trust it?
The huge number you are using is indeed within the "range" of float, but not all its digits are within the "accuracy" of float. A 32-bit float has an accuracy of about 7 digits, so beyond the first 6-7 digits, all bets are off. That's why the math doesn't work out, and printing looks "wrong" when you use these large numbers. If you want more accuracy, use double. For more, see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers
A float number takes about 6-7 decimal places (23 bit for the fraction) so any number with more decimal places is just an approximation. Which leads to that rondom number.
For more about floating point format precision: http://en.wikipedia.org/wiki/Single-precision_floating-point_format
For the updated question:
You should never use floating point format when the precision is required. We can't just specify larger space of memory. Handling numbers with very large amount of decimal places needs a very large amount of memory .So more complicated methods are used instead ( for exemple using a string format then processing the characters successively) .
To avoid that problem use double which gives about 16-17 decimal places (52 bit for the fraction) or long double for even more precision.
#include <stdio.h>
int main()
{
double a = 10000000000000000.0;
double b = a - 10000000000000000.0;
printf("%f\n%f", a, b);
}
exemple http://ideone.com/rJN1QI
Your confusion is caused by implicit conversions and lack of accuracy of float.
Let me fill in the implicit conversions for you:
float a = (float)10000000000000000.0;
float b = (float)((double)a - 10000000000000000.0);
This converts the literal double to float, and the closest it can get is 10000000272564224. And then the subtraction is performed using double, not float, so the second 10000000000000000.0 does not lose precision.
We can use the nextafter function to get a better idea of the precision of floating-point types. nextafter takes two arguments; it returns the adjacent representable number to its first argument, in the direction of its second argument.
The value 10000000000000000.0 (or 1.0e16) is well within the range of representable values of type float, but that value itself cannot be represented exactly.
Here's a small program that illustrates the issue:
#include <math.h>
#include <stdio.h>
int main()
{
float a = 10000000000000000.0;
double d_a = 10000000000000000.0;
printf(" %20.2f\n", nextafterf(a, 0.0f));
printf("a = %20.2f\n", a);
printf(" %20.2f\n", nextafterf(a, 1.0e30f));
putchar('\n');
printf(" %20.2f\n", nextafter(d_a, 0.0));
printf("d_a = %20.2f\n", d_a);
printf(" %20.2f\n", nextafter(d_a, 1.0e30));
putchar('\n');
}
and here's its output on my system:
9999999198822400.00
a = 10000000272564224.00
10000001346306048.00
9999999999999998.00
d_a = 10000000000000000.00
10000000000000002.00
If you use type float, the closest you can get to 10000000000000000.00 is 10000000272564224.00.
But in your second declaration:
float b = a - 10000000000000000.0
the subtraction is done in type double; the constant 10000000000000000.0 is already of type double, and a is promoted to double to match. So this takes the poor approximation of 1.0e16 that's stored in a, and subtracts from it the much better approximation (in fact it's exact) that can be represented in type double.
I have to convert some code from Fortran so doesn't know how what this statement means:
var1 = 10.D00
Can someone explain me what it means?
It's just 10.0 in scientific notation with double precision (that's what the D stands for).
See: http://www.fortran.com/F77_std/rjcnf0001-sh-4.html#sh-4.2.1:
4.5.1 Double Precision Exponent.
The form of a double precision exponent is the letter D followed by an optionally signed integer constant. A double precision exponent denotes a power of ten. Note that the form and interpretation of a double precision exponent are identical to those of a real exponent, except that the letter D is used instead of the letter E.
I'm translating some Fortran to our C# app and I'm trying to figure out what a bit of Fortran means at the top of a function.
DOUBLE PRECISION INF, DMIN, D12
DATA INF/1.D+300/
What would the value of INF be?
The D means "× 10???", or commonly known as the e in 1.e+300 in C#, but it's for double precision.
The DOUBLE PRECISION statement just defines 3 variables to be of type double.
The DATA statement
DATA X/Y/
translates to
X = Y;
in C#. Hence you get
double INF = 1.e+300, DMIN, D12;
Since INF is so large I believe it means "Infinity", in that case it's better to use the real IEEE infinity (double INF = double.PositiveInfinity, ...).
The code is declaring a constant called INF (i.e. infinity) with the value 10^300. You would want to substitute double.PositiveInfinity or double.MaxValue.
The code is in the style of FORTRAN IV or FORTRAN 77 rather than Fortran 90/95/2003.
Double Precision declares the variables to be double the precision of a regular real. I'm not sure that the FORTRAN standards of that era were extremely precise about what that meant, since there was a greater variety of numeric hardware then. Today, it will virtually always obtain an 8-byte real. The Data statement initializes the variable INF. The use of "D" in the constant "1.D+300", instead of E, is old FORTRAN to specify that the constant is double precision.
The Fortran (>=90) way of obtaining the largest positive double is:
INF = huge (1.0D+0)
The value would be 1.0e300, but I'm sure that what is intended is that it be set to the largest double value that can be expressed on the current CPU. so in C# that would be double.PositiveInfinity rather than some hard-coded value.