My outputs are all NaN, and the standard error is "IEEE_INVALID_FLAG". I debug the code in gdb and find that the line that IEEE_INVALID_FLAG first happens is line 281:
Program received signal SIGFPE, Arithmetic exception.
0x000055555555c830 in calcu () at SIMPLE-2D.f:281
281 & +(1.-URFU)*U(I,J)
The code for line 281 is an expression for
enter image description here, and the complete code is:
U(I,J)=URFU/APU(I,J)*
& (AEEU(I,J)*U(I+2,J)+AEU(I,J)*U(I+1,J)
& +AWWU(I,J)*U(I-2,J)+AWU(I,J)*U(I-1,J)
& +ANNU(I,J)*U(I,J+2)+ANU(I,J)*U(I,J+1)
& +ASSU(I,J)*U(I,J-2)+ASU(I,J)*U(I,J-1)
& +(P(I,J)-P(I+1,J))*DY)
& +(1.-URFU)*U(I,J)
I=1:79,J=1:80. AEEU,AEU,... are 79*80 matrix.
Could anyone can give me some idea about this error? Many thanks!
Most of the time, NaNs result from operations on infinities, e.g., Infinity * 2 = NaN. As suggested by the compiler output, you have both overflow and underflow. Overflow happens when the variable type cannot contain the number because the exponent is positive and too big (very large number), and underflow happens when the number is too small because the exponent is negative and too big (very small number). Try changing your code to use double precision real. In FORTRAN 77, this can be achieved using the DOUBLE PRECISION type:
DOUBLE PRECISION URFU
DOUBLE PRECISION U(:,:)
In modern Fortran, you can use something like this:
INTEGER, PARAMETER :: dp = KIND(1.D0)
REAL(KIND=dp) :: URFU, U(:,:)
Related
I am a university lecturer, and I will teach the Numerical Methods course this semester using Fortran 90/95 as the programming language. The beginning of the course starts with the representation of numbers, and I would like to talk about the limits of numbers that can be represented with REAL(4), REAL(8) and REAL(16). I intend to use the following code on OnlineGDB (so that students won't have to install anything on their computers, which may be a pain in times of remote learning):
Program declare_reals
implicit none
real(kind = 4) :: a_huge, a_tiny ! single precision ; default if kind not specified
!real(4) :: a ! Equivalent to real(kind = 4) :: a
a_huge = huge(a_huge)
print*, "Max positive for real(4) : ", a_huge
a_tiny = tiny(a_tiny)
print*, "Min positive for real(4) : ", a_tiny
print*,
End Program declare_reals
With this code, I get
Max positive for real(4) : 3.40282347E+38
Min positive for real(4) : 1.17549435E-38
However, if I write a_tiny = tiny(a_tiny)/2.0, the output becomes
Min positive for real(4) : 5.87747175E-39
Looking at the documentation for gfortran (which OnlineGDB uses as the f95 compiler), I had the impression that anything below tiny(x) could result in an underflow and zero would show instead of a non-zero number. Could anyone help me understand what is happening here? If tiny(x) doesn't yield the smallest positive representable number, what is being shown due to the function call?
The Fortran Standard states the following about a real value:
The model set for real x is defined by
where b and p are integers exceeding one; each fk is
a non-negative integer less than b, with f1
nonzero; s is
+1 or −1; and e is an integer that lies between some integer maximum emax and some integer minimum emin inclusively. For x = 0, its exponent e and digits fk
are defined to be zero. The integer parameters b, p,
emin, and emax determine the set of model floating-point numbers.
Real values which satisfy this definition, are referenced to be model numbers or normal floating point numbers. The floating point numbers your system can represent, i.e. the machine-representable numbers are a superset of the model numbers. They can, but not necessarily must, include the values with f1 zero — also known as subnormal floating point numbers — and are there to fill the underflow gap around zero.
The Fortran functions tiny(x), huge(x), epsilon(x), spacing(x) are all defined for model numbers.
The value of tiny(x) is given by bemin − 1, which for a single-precision floating-point number (binary32) is given by 2−126 and is the smallest model (normal) number. When your system follows IEEE754, the machine representable numbers will also contain the subnormal numbers. The smallest subnormal positive number is given bytiny(x)*epsilon(x) which in binary32 is 2−126 × 2−23. This explains why you can divide tiny(x) by two, i.e. the transition from normal to subnormal.
# smallest normal number
0 00000001 000000000000000000000002 = 0080 000016 = 2−126 ≈ 1.1754943508 × 10−38
# smallest subnormal number
0 00000000 000000000000000000000012 = 0000 000116 = 2−126 × 2−23 ≈ 1.4012984643 × 10−45
Note: when you divide tiny(x)*epsilon(x) by two, gfortran returns an arithmetic underflow error.
Ref: values taken from Wikipedia: Single precision floating-point format
I have the following fortran code defined under. I am trying to change the length of the do loop if i change the value of n. When i try to compile i get the error:
‘a’ argument of ‘floor’ intrinsic at (1) must be REAL. But when i change q and w to be defined as real i get another error message. How can i fix this? q and w is clearly a integer when i use floor(...)
subroutine boundrycon(n,bc,u,v)
!input
integer :: n,bc
!output
real(8) :: u(n+2,n+2), v(n+2,n+2)
!lokale
integer :: j,i,w,q
n=30
q=floor(n/2)
w=(floor(n/2)+floor(n/6))
do j=q,w
u(q,j)=0.0;
v(q+1,j)=-v(q,j);
u(w,j)=0.0;
v(w+1,j)=-v(w,j);
end do
do i=q,w
v(i,q)=0.0;
u(i,q)=-u(i,q+1);
u(i,w+1)=-u(i,w);
v(i,w)=0;
end do
end subroutine boundrycon
Many people have already pointed this out in the comments to your question, but here it is again as an answer:
In Fortran, if you do a division of two integer values, the result is an integer value.
6/3 = 2
If the numerator is not evenly divisible by the denominator, then the remainder is dropped:
7/3 = 2
Let's look at your code:
q=floor(n/2)
It first evaluates n/2 which, since both n and 2 are integers, is such an integer division. As mentioned before, this result is an integer.
This integer is then passed as argument to floor. But floor expects a floating point variable (or, as Fortran calls it: REAL). Hence the error message:
"[The] argument of floor ... must be REAL."
So, the easiest way to get what you want is to just remove the floor altogether, since the integer division does exactly what you want:
q = n/2 ! Integer Division
If you need to make a floating point division, that is if you want two integer variables to divide into a real variable, you have to convert at least one of them to floating point before the division:
print *, 3/2 ! wrong, prints 1
print *, real(3)/2 ! right
print *, 3/2.0 ! right
print *, (3 * 1.0) / 2 ! right
print *, real(3/2) ! wrong, prints 1.0
I am testing some very simple equivalence errors when precision is an issue and was hoping to perform the operations in extended double precision (so that I knew what the answer would be in ~19 digits) and then perform the same operations in double precision (where there would be roundoff error in the 16th digit), but somehow my double precision arithmetic is maintaining 19 digits of accuracy.
When I perform the operations in extended double, then hardcode the numbers into another Fortran routine, I get the expected errors, but is there something strange going on when I assign an extended double precision variable to a double precision variable here?
program code_gen
implicit none
integer, parameter :: Edp = selected_real_kind(17)
integer, parameter :: dp = selected_real_kind(8)
real(kind=Edp) :: alpha10, x10, y10, z10
real(kind=dp) :: alpha8, x8, y8, z8
real(kind = dp) :: pi_dp = 3.1415926535897932384626433832795028841971693993751058209749445
integer :: iter
integer :: niters = 10
print*, 'tiny(x10) = ', tiny(x10)
print*, 'tiny(x8) = ', tiny(x8)
print*, 'epsilon(x10) = ', epsilon(x10)
print*, 'epsilon(x8) = ', epsilon(x8)
do iter = 1,niters
x10 = rand()
y10 = rand()
z10 = rand()
alpha10 = x10*(y10+z10)
x8 = x10
x8 = x8 - pi_dp
x8 = x8 + pi_dp
y8 = y10
y8 = y8 - pi_dp
y8 = y8 + pi_dp
z8 = z10
z8 = z8 - pi_dp
z8 = z8 + pi_dp
alpha8 = alpha10
write(*, '(a, es30.20)') 'alpha8 .... ', x8*(y8+z8)
write(*, '(a, es30.20)') 'alpha10 ... ', alpha10
if( alpha8 .gt. x8*(y8+z8) ) then
write(*, '(a)') 'ERROR(.gt.)'
elseif( alpha8 .lt. x8*(y8+z8) ) then
write(*, '(a)') 'ERROR(.lt.)'
endif
enddo
end program code_gen
where rand() is the gfortran function found here.
If we are speaking about only one precision type (take, for example, double), then we can denote machine epsilon as E16 which is approximately 2.22E-16. If we take a simple addition of two Real numbers, x+y, then the resulting machine expressed number is (x+y)*(1+d1) where abs(d1) < E16. Likewise, if we then multiply that number by z, the resulting value is really (z*((x+y)*(1+d1))*(1+d2)) which is nearly (z*(x+y)*(1+d1+d2)) where abs(d1+d2) < 2*E16. If we now move to extended double precision, then the only thing that changes is that E16 turns to E20 and has a value of around 1.08E-19.
My hope was to perform the analysis in extended double precision so that I could compare two numbers which should be equal but show that, on occasion, roundoff error will cause comparisons to fail. By assigning x8=x10, I was hoping to create a double precision 'version' of the extended double precision value x10, where only the first ~16 digits of x8 conform to the values of x10, but upon printing out the values, it shows that all 20 digits are the same and the expected double precision roundoff error is not occurring as I would expect.
It should also be noted that before this attempt, I wrote a program which actually writes another program where the values of x, y, and z are 'hardcoded' to 20 decimal places. In this version of the program, the comparisons of .gt. and .lt. failed as expected, but I am not able to duplicate the same failures by casting an extended double precision value as a double precision variable.
In an attempt to further 'perturb' the double precision values and add roundoff error, I have added, then substracted, pi from my double precision variables which should leave the remaining variables with some double precision roundoff error, but I am still not seeing that in the final result.
As the gfortran documentation you link states, the function result of rand is a default real value (single precision). Such a value can be represented exactly by each of your other real types.
That is, x10=rand() assigns a single precision value to the extended precision variable x10. It does so exactly. This same value now stored in x10 is assigned to the double precision variable x8, but this remains exactly representable as double precision.
There is sufficient precision in the single-as-double that the calculations using double and extended types return the same value. [See the note at the end of this answer.]
If you wish to see real effects of loss of precision, then start by using an extended or double precision value. For example, rather than using rand (returning a single precision value), use the intrinsic random_number
call random_number(x10)
(which has the advantage of being standard Fortran). Unlike a function, which in (nearly) all cases returns a value type regardless of the end use of the value, this subroutine will give you a precision corresponding to the argument. You will (hopefully) see much as you will from your "hard-coded" experiment.
Alternatively, as agentp commented, it may be more intuitive to start with a double precision value
call random_number(x8); x10=x8 ! x8 and x10 have the precision of double precision
call random_number(y8); y10=y8
call random_number(z8); z10=z8
and perform the calculations from that starting point: those extra bits will then start to show.
In summary, when you do x8=x10 you are getting the first few bits of x8 corresponding to those of x10, but many of those bits and those that follow in x10 are all zero.
When it comes to your pi_dp perturbation, you are again assigning a single precision (this time a literal constant) value to a double precision variable. Just having all those digits doesn't make it anything other than a default real literal. You can specify a different kind of literal with a _Edp suffix, as described in other answers.
Finally, one also then has to worry about what the compiler does with regards to optimization.
My thesis is that starting from the single precision value, the calculations performed are representable exactly in both double and extended precision (with the same values). For other calculations, or from a starting point with more bits set, or representations (for example, on some systems or with other compilers the numeric type with kind selected_real_kind(17) may have quite different characteristics such as a different radix) that needn't be the case.
While this was largely based on guessing and hoping it explained the observation. Fortunately, there are ways to test this idea. As we're talking about IEEE arithmetic we can consider the inexact flag. If that flag isn't raised during the computation we can be happy.
With gfortran there is the compilation option -ffpe=inexact which will make the inexact flag signalling. With gfortran 5.0 the intrinsic module ieee_exceptions is supported which can be used in a portable/standard manner.
You can consider this flag for further experimentation: if it is raised then you can expect to see differences between the two precisions.
Below is the code I've tested in a 64-bit environment and 32-bit. The result is off by one precisely each time. The expected result is: 1180000000 with the actual result being 1179999999. I'm not sure exactly why and I was hoping someone could educate me:
#include <stdint.h>
#include <iostream>
using namespace std;
int main() {
double odds = 1.18;
int64_t st = 1000000000;
int64_t res = st * odds;
cout << "result: " << res << endl;
return 1;
}
I appreciate any feedback.
1.18, or 118 / 100 can't be exactly represented in binary, it will have repeating decimals. The same happens if you write 1 / 3 in decimal.
So let's go over a similar case in decimal, let's calculate (1 / 3) × 30000, which of course should be 10000:
odds = 1 / 3 and st = 30000
Since computers have only a limited precision we have to truncate this number to a limited number of decimals, let's say 6, so:
odds = 0.333333
0.333333 × 10000 = 9999.99. The cast (which in your program is implicit) will truncate this number to 9999.
There is no 100% reliable way to work around this. float and double just have only limited precision. Dealing with this is a hard problem.
Your program contains an implicit cast from double to an integer on the line int64_t res = st * odds;. Many compilers will warn you about this. It can be the source of bugs of the type you are describing. This cast, which can be explicitly written as (int64_t) some_double, rounds the number towards zero.
An alternative is rounding to the nearest integer with round(some_double);. That will—in this case—give the expected result.
First of all - 1.18 is not exactly representable in double. Mathematically the result of:
double odds = 1.18;
is 1.17999999999999993782751062099 (according to an online calculator).
So, mathematically, odds * st is 1179999999.99999993782751062099.
But in C++, odds * st is an expression with type double. So your compiler has two options for implementing this:
Do the computation in double precision
Do the computation in higher precision and then round the result to double
Apparently, doing the computation in double precision in IEEE754 results in exactly 1180000000.
However, doing it in long double precision produces something more like 1179999999.99999993782751062099
Converting this to double is now implementation-defined as to whether it selects the next-highest or next-lowest value, but I believe it is typical for the next-lowest to be selected.
Then converting this next-lowest result to integer will truncate the fractional part.
There is an interesting blog post here where the author describes the behaviour of GCC:
It uses long double intermediate precision for x86 code (due to the x87 FPUs long double registers)
It uses actual types for x64 code (because the SSE/SSE2 FPU supports this more naturally)
According to the C++11 standard you should be able to inspect which intermediate precision is being used by outputting FLT_EVAL_METHOD from <cfloat>. 0 would mean actual values, 2 would mean long double is being used.
I would have dared say that the numeric values computed by Fortran and C++ would be way more similar. However, from what I am experiencing, it turns out that the calculated numbers start to diverge after too few decimal digits. I have come across this problem during the process of porting some legacy code from the former language to the latter. The original Fortran 77 code...
INTEGER M, ROUND
DOUBLE PRECISION NUMERATOR, DENOMINATOR
M = 2
ROUND = 1
NUMERATOR=5./((M-1+(1.3**M))**1.8)
DENOMINATOR = 0.7714+0.2286*(ROUND**3.82)
WRITE (*, '(F20.15)') NUMERATOR/DENOMINATOR
STOP
... outputs 0.842201471328735, while its C++ equivalent...
int m = 2;
int round = 1;
long double numerator = 5.0 / pow((m-1)+pow(1.3, m), 1.8);
long double denominator = 0.7714 + 0.2286 * pow(round, 3.82);
std::cout << std::setiosflags(std::ios::fixed) << std::setprecision(15)
<< numerator/denominator << std::endl;
exit(1);
... returns 0.842201286195064. That is, the computed values are equal only up to the sixth decimal. Although not particularly a Fortran advocator, I feel inclined to consider its results as the 'correct' ones, given its legitimate reputation of number cruncher. However, I am intrigued about the cause of this difference between the computed values. Does anyone know what the reason for this discrepancy could be?
In Fortran, by default, floating point literals are single precision, whereas in C/C++ they are double precision.
Thus, in your Fortran code, the expression for calculating NUMERATOR is done in single precision; it is only converted to double precision when assigning the final result to the NUMERATOR variable.
And the same thing for the expression calculating the value that is assigned to the DENOMINATOR variable.