I have compiled a simple program given in this LINK to test the fpe compiling option on my mac M1.
Using ifort fpe.f90 -fpe0 -fp-model strict -g the output is
Underflow: 0.1E-29 * 0.1E-09 = 0.1E-39
Overflow: 0.1E+31 * 0.1E+31 = Infinity
Div-by-zero: -0.1E+31 / 0.0E+00 = -Infinity
Invalid: 0.0E+00 / 0.0E+00 = NaN
So apparently the option does not work. The same code compiled in the same way on my Linux machine gives the expected results.
I would like to know the reason of this behavior.
This is the code I have compiled is the following
IMPLICIT NONE
real*4 res_uflow, res_oflow
real*4 res_dbyz, res_inv
real*4 small, big, zero, scale
small = 1.0e-30
big = 1.0e30
zero = 0.0
scale = 1.0e-10
! IEEE underflow condition (Underflow Raised)
res_uflow = small * scale
write(6,100)"Underflow: ",small, " *", scale, " = ", res_uflow
! IEEE overflow condition (Overflow Raised)
res_oflow = big * big
write(6,100)"Overflow: ", big, " *", big, " = ", res_oflow
! IEEE divide-by-zero condition (Divide by Zero Raised)
res_dbyz = -big / zero
write(6,100)"Div-by-zero: ", -big, " /", zero, " = ", res_dbyz
! IEEE invalid condition (Invalid Raised)
res_inv = zero / zero
write(6,100)"Invalid: ", zero, " /", zero, " = ", res_inv
100 format(A14,E8.1,A2,E8.1,A2,E10.1)
end
I know it's a run-time option: it's a sort of exception handling in Fortran.
As for the secondo question I have installed ifort on my mac M1 and it works.
Related
This question appeared a thousand times on different platforms. However I still need to understand something.
Here is a complete example:
#include <iostream>
#include <iomanip>
#include <cmath>
// function to derivative
double f (double var, void * params){
(void)(params);
return pow (var, 1.5);
}
// Naive central step derivative
double derivative1(double var, double f(double,void*), double h){
return (f(var+h,NULL) - f(var-h,NULL) )/2.0/h;
}
// Richardson 5-point rule
double gderivative(double var, double f(double,void*), double h0){
return(4.0*derivative1(var,f,h0) - derivative1(var,f,2.0*h0))/3.0;
}
int main (void){
for (int i=10;i>0;i--){
double h0=pow(10,-i);
double x=2.0;
double exact = 1.5 * sqrt(x);
double test1=derivative1(x,f,h0);
double gtest=gderivative(x,f,h0);
std::cout << "h0 = " << h0 << std::endl;
std::cout << "Exact = " << std::scientific<<std::setprecision(15) << exact << std::endl;
std::cout << "Naive step = " << std::setprecision(15) << test1 <<", diff = " << std::setprecision(15)<< exact-test1 << ", percent error = " << (exact-test1)/exact*100.0 << std::endl;
std::cout << "Richardson = " << std::setprecision(15) << gtest <<", diff = " << std::setprecision(15)<< exact-gtest << ", percent error = " << (exact-gtest)/exact*100.0 << std::endl;
}
return 0;
}
The output is
h0 = 1e-10
Exact = 2.121320343559643e+00
Naive step = 2.121318676273631e+00, diff = 1.667286011475255e-06, percent error = 7.859661632610832e-05
Richardson = 2.121318306199290e+00, diff = 2.037360352868944e-06, percent error = 9.604208808228318e-05
h0 = 1.000000000000000e-09
Exact = 2.121320343559643e+00
Naive step = 2.121320674675076e+00, diff = -3.311154328500265e-07, percent error = -1.560893119491818e-05
Richardson = 2.121320748689944e+00, diff = -4.051303013064000e-07, percent error = -1.909802555452698e-05
h0 = 1.000000000000000e-08
Exact = 2.121320343559643e+00
Naive step = 2.121320341608168e+00, diff = 1.951474537520426e-09, percent error = 9.199339191957163e-08
Richardson = 2.121320341608168e+00, diff = 1.951474537520426e-09, percent error = 9.199339191957163e-08
h0 = 1.000000000000000e-07
Exact = 2.121320343559643e+00
Naive step = 2.121320341608168e+00, diff = 1.951474537520426e-09, percent error = 9.199339191957163e-08
Richardson = 2.121320340868019e+00, diff = 2.691623368633600e-09, percent error = 1.268843424240664e-07
h0 = 1.000000000000000e-06
Exact = 2.121320343559643e+00
Naive step = 2.121320343606570e+00, diff = -4.692690680485612e-11, percent error = -2.212155601454860e-09
Richardson = 2.121320343643577e+00, diff = -8.393419292929138e-11, percent error = -3.956695799581460e-09
h0 = 1.000000000000000e-05
Exact = 2.121320343559643e+00
Naive step = 2.121320343584365e+00, diff = -2.472244631235299e-11, percent error = -1.165427295665677e-09
Richardson = 2.121320343595468e+00, diff = -3.582467655860455e-11, percent error = -1.688791448560268e-09
h0 = 1.000000000000000e-04
Exact = 2.121320343559643e+00
Naive step = 2.121320343340116e+00, diff = 2.195266191051815e-10, percent error = 1.034858406801534e-08
Richardson = 2.121320343561791e+00, diff = -2.148059508044753e-12, percent error = -1.012604963020456e-10
h0 = 1.000000000000000e-03
Exact = 2.121320343559643e+00
Naive step = 2.121320321462283e+00, diff = 2.209735949776359e-08, percent error = 1.041679516479040e-06
Richardson = 2.121320343559311e+00, diff = 3.317346397579968e-13, percent error = 1.563812088849040e-11
h0 = 1.000000000000000e-02
Exact = 2.121320343559643e+00
Naive step = 2.121318133840577e+00, diff = 2.209719065504601e-06, percent error = 1.041671557157002e-04
Richardson = 2.121320343601055e+00, diff = -4.141265108614789e-11, percent error = -1.952211093995174e-09
h0 = 1.000000000000000e-01
Exact = 2.121320343559643e+00
Naive step = 2.121099269013200e+00, diff = 2.210745464426012e-04, percent error = 1.042155406248691e-02
Richardson = 2.121320759832334e+00, diff = -4.162726914280768e-07, percent error = -1.962328286210455e-05
I believe the standard GSL gsl_deriv_central employs the Richardson procedure.
Now the common agument that is given to choose h0 is that theoretically it should be chosen as small as possible to improve the precision of derivative however numerically it should not be too small so that we hit floating point round off thus ruining the precision. So often it is said that the optimal choice should be somewhat around 1e-6 - 1e-8. My question is :
What is the optimal choice for h0 in a generic derivative?
Should I have to check case by case? Often it might not be possible to have an exact result to check with. What should one do in that case?
Now in this particular case it seems like the best choice for Naive step is h0 = 1.000000000000000e-05 whereas for Richardson h0 = 1.000000000000000e-03. This confuses me since these are not small.
Any suggestion on any other good options (easy algorithm/library) which is efficient as well as precise(double)?
This is quite as expected. The central difference has a second order error O(h^2) to the exact derivative. The function evaluation has an error of magnitude mu, the machine constant (for mildly scaled test examples). Thus the evaluation error of the central difference is of magnitude mu/h. The overall error is smallest if these two influences are about equal, thus h=mu^(1/3) gives h=1e-5, with an error of about 1e-10.
The same calculation for the Richardson extrapolation gives error order O(h^4) towards the exact derivative, resulting in h=mu^(1/5)=1e-3 as optimal step size, with an error of about 1e-12.
loglog plot of the errors of both methods over a larger sample of step sizes
In practical applications you would want to scale h so that the indicated sizes are relative to the size of x. The exact optimum depends also on the magnitudes of the derivatives of the function, if they grow too wildly.
To get more precise values for the derivatives, you would need a higher or multi-precision data type, or employ automatic/algorithmic differentiation, where the first and second derivatives can be evaluated with the same precision as the function values.
Personal opinion warning
In my experience i find it better to use a stepsize that is small compared to the variable it affects.
For example, I usually use something like this:
auto dfdx(std::function<double (double)> f, double x, double h0=1e-5){
h0=std::abs(x)*h0+h0; // this is 1e-5 if x<<1e-5, and |x|*1e-5 otherwise
return ( f(x+h0) - f(x-h0) )/2/h0;
}
This should work, since finite differences are motivated by Taylor expansion. That is, as long as x<<h0, the finite difference should be an good approximation.
The outputs of two calculations are supposed to be same as described below but even after taking machine precision into account, they come out to be unequal. What would be way around it to get them equal?
#include <iostream>
#include <limits>
#include <math.h>
bool definitelyGreaterThan(double a, double b, double epsilon)
{
return (a - b) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
bool definitelyLessThan(double a, double b, double epsilon)
{
return (b - a) > ( (std::fabs(a) < std::fabs(b) ? std::fabs(b) : std::fabs(a)) * epsilon);
}
int main ()
{
double fig1, fig2;
double m1 = 235.60242, m2 = 126.734781;
double n1 = 4.2222, n2 = 2.1111;
double p1 = 1.245, p2 = 2.394;
fig1 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2;
m1 = 1.2*m1, m2 = 1.2*m2; // both scaled equally, numerator and denominator
n1 = 2.0*n1, n2 = 2.0*n2;
p1 = 3.0*p1, p2 = 3.0*p2;
fig2 = (m1/m2) * (n1/n2) - p1 * 6.0 / p2; // same as above
double epsilon = std::numeric_limits<double>::epsilon();
std::cout << "\n fig1 " << fig1 << " fig2 " << fig2 << " difference " << fig1 - fig2 << " epl " << epsilon;
std::cout << "\n if(fig1 < fig2) " << definitelyLessThan(fig1, fig2, epsilon)
<< "\n if(fig1 > fig2) " << definitelyGreaterThan(fig1, fig2, epsilon) << "\n";
}
with output as -
fig1 0.597738 fig2 0.597738 difference 8.88178e-16 epl 2.22045e-16
if(fig1 < fig2) 0
if(fig1 > fig2) 1
The difference between two numbers is greater than machine precision.
The key question is whether there is any universal method to deal with such aspects or solution has to be application dependent?
There are two things to consider:
First, the possible rounding error (introduced by limited machine precision) scales with the number of operations that were used to calculate the value. For example, storing the result of m1/m2 might introduce some rounding error (depending on the actual value), and multiplying this value with something also multiplies that rounding error, and adds another possiblie rounding error on top of that (by storing that result).
Second, floating point values are not stored linearly (instead, they use an exponent-and-mantissa format): The bigger a values actually is, the bigger is the difference between that value and the next representable value (and therefore also the possible rounding error). std::numeric_limits<T>::epsilon() only states the difference between 1.0 and the next representable value, so if your value is not exactly 1.0, then this epsilon does not exactly represent the machine precision (meaning: The difference between this value and the next representable one) for that value.
So, to answer your question: The solution is to select an application-dependend, reasonable maximum rounding error that is allowed for two values to still be considered equal. Since this allowed rounding error depends both on expected values and number of operations (as well as what's acceptable for the application itself, of course), a truly universal solution is not possible.
I've implemented code to find the polar coordinates of a point in 2D space. if the point lies in the 1st or 2nd Quadrant, 0<=theta<=pi and if it lies in the 3rd or 4th Quadrant, -pi <= theta <= 0.
module thetalib
contains
real function comp_theta( x1, x2)
implicit none
real , intent(in) :: x1, x2
real :: x1p, x2p
real :: x1_c=0.0, x2_c=0.0
real :: pi=4*atan(1.0)
x1p = x1 - x1_c
x2p = x2 - x2_c
! - Patch
!if ( x1p == 0 .and. x2p /= 0 ) then
! comp_theta = sign(pi/2.0, x2p)
!else
! comp_theta = atan ( x2p / x1p )
!endif
comp_theta = atan( x2p / x1p)
if ( x1p >= 0.0 .and. x2p >= 0.0 ) then
comp_theta = comp_theta
elseif ( x1p < 0 .and. x2p >= 0.0 ) then
comp_theta = pi + comp_theta
elseif( x1p < 0.0 .and. x2p < 0.0 ) then
comp_theta = -1* (pi - comp_theta)
elseif ( x1p >= 0.0 .and. x2p < 0.0 ) then
comp_theta = comp_theta
endif
return
end function comp_theta
end module thetalib
program main
use thetalib
implicit none
! Quadrant 1
print *, "(0.00, 1.00): ", comp_theta(0.00, 1.00)
print *, "(1.00, 0.00): ", comp_theta(1.00, 0.00)
print *, "(1.00, 1.00): ", comp_theta(1.00, 1.00)
! Quadrant 2
print *, "(-1.00, 1.00): ", comp_theta(-1.00, 1.00)
print *, "(-1.00, 0.00): ", comp_theta(-1.00, 0.00)
! Quadrant 3
print *, "(-1.00, -1.00): ", comp_theta(-1.00, -1.00)
! Quadrant 4
print *, "(0.00, -1.00): ", comp_theta(0.00, -1.00)
print *, "(1.00, -1.00): ", comp_theta(1.00, -1.00)
end program main
In the function thetalib::comp_theta, when there is a division by zero and the numerator is +ve, fortran evaluates it to be -infinity and when the numerator is -ve, it evaluates it to be +infinity ( see output )
(0.00, 1.00): -1.570796
(1.00, 0.00): 0.0000000E+00
(1.00, 1.00): 0.7853982
(-1.00, 1.00): 2.356194
(-1.00, 0.00): 3.141593
(-1.00, -1.00): -2.356194
(0.00, -1.00): 1.570796
(1.00, -1.00): -0.7853982
This baffled me. I've also implemented the patch you see to work around it. And to investigate it further, I setup a small test:
program main
implicit none
real :: x1, x2
x1 = 0.0 - 0.0 ! Reflecting the x1p - 0.0
x2 = 1.0
write(*,*) "x2/x1=", x2/x1
x2 = -1.0
write(*,*) "x2/x1=", x2/x1
end program main
This evaluates to:
x2/x1= Infinity
x2/x1= -Infinity
My fortran version:
$ ifort --version
ifort (IFORT) 19.0.1.144 20181018
Copyright (C) 1985-2018 Intel Corporation. All rights reserved.
And I have three questions:
Why there are signed infinite values?
How are the signs determined?
Why does infinity take the signs shown in outputs for both thetalib::comp_theta and the test program?
That there are signed infinite values follows from the compiler supporting IEEE arithmetic with the real type.
For motivation, consider real non-zero numerator and denominator. If these are both of the same sign then the quotient is a real (finite) positive number. If they are of opposite sign the quotient is a real (finite) negative number.
Consider the limit 1/x as x tends to zero from below. For any strictly negative value of x the value is negative. For continuity considerations the limit can be taken to be negative infinity.
So, when the numerator is non-zero, the quotient will be positive infinity if the numerator and denominator are of the same sign, and negative if of opposite sign. Recall also, that the zero denominator may be signed.
If you wish to examine the number, to see whether it is finite you can use the procedure IEEE_IS_FINITE of the intrinsic module ieee_arithmetic. Further, that module has the procedure IEEE_CLASS which provides useful information about its argument. Among other things:
whether it is a positive or negative normal number;
whether it is a positive or negative infinite value;
whether it is a positive or negative zero.
You can also try checking if the number equals itself. if not then. it is infinite.
EX: if ( x2x1 .eq. x2x1) then GOOD number. if not then infinity.
It also may be that the value holding x1 is calculated by the computer where all bits in the number are set to 1 (-infinity) and doing a bitwise divide you get the following:
which, is actually a subtraction operation where (0....001 - 01...111) = -Infinity
and (0....001 - 11.....111) = +Infinity I would look up bitwise divide and see the info on that. More is done, but I don't need to explain the details.
here a little test code:
float zeroF = 0.f;
float naNF = 0.f / zeroF;
float minimumF = std::min(1.0f, naNF);
std::cout << "MinimumF " << minimumF << std::endl;
double zeroD = 0.0;
double naND = 0.0 / zeroD;
double minimumD = std::min(1.0, naND);
std::cout << "MinimumD " << minimumD << std::endl;
I executed the code on VS2013.
On precise model (/fp:precise) the outputs are always "1";
On fast model (/fp:fast) the outputs will be "Nan" (-1.#IND) if optimization is enabled (/O2) and "1" if optimization is disabled (/Od).
First, what should be the right output according to IEEE754 ?
(I read the docs and googled different articles like: What is the rationale for all comparisons returning false for IEEE754 NaN values?, and it seems that the right output should by Nan and not 1 but maybe I am wrong.
Secondly, how the fast model optimization here changes so drastically the output?
In my code I have this multiplications in a C++ code with all variable types as double[]
f1[0] = (f1_rot[0] * xu[0]) + (f1_rot[1] * yu[0]);
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1]);
f1[2] = (f1_rot[0] * xu[2]) + (f1_rot[1] * yu[2]);
f2[0] = (f2_rot[0] * xu[0]) + (f2_rot[1] * yu[0]);
f2[1] = (f2_rot[0] * xu[1]) + (f2_rot[1] * yu[1]);
f2[2] = (f2_rot[0] * xu[2]) + (f2_rot[1] * yu[2]);
corresponding to these values
Force Rot1 : -5.39155e-07, -3.66312e-07
Force Rot2 : 4.04383e-07, -1.51852e-08
xu: 0.786857, 0.561981, 0.255018
yu: 0.534605, -0.82715, 0.173264
F1: -6.2007e-07, -4.61782e-16, -2.00963e-07
F2: 3.10073e-07, 2.39816e-07, 1.00494e-07
this multiplication in particular produces a wrong value -4.61782e-16 instead of 1.04745e-13
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1]);
I hand verified the other multiplications on a calculator and they all seem to produce the correct values.
this is an open mpi compiled code and the above result is for running a single processor, there are different values when running multiple processors for example 40 processors produces 1.66967e-13 as result of F1[1] multiplication.
Is this some kind of mpi bug ? or a type precision problem ? and why does it work okay for the other multiplications ?
Your problem is an obvious result of what is called catastrophic summations:
As we know, a double precision float can handle numbers of around 16 significant decimals.
f1[1] = (f1_rot[0] * xu[1]) + (f1_rot[1] * yu[1])
= -3.0299486605499998e-07 + 3.0299497080000003e-07
= 1.0474500005332475e-13
This is what we obtain with the numbers you have given in your example.
Notice that (-7) - (-13) = 6, which corresponds to the number of decimals in the float you give in your example: (ex: -5.39155e-07 -3.66312e-07, each mantissa is of a precision of 6 decimals). It means that you used here single precision floats.
I am sure that in your calculations, the precision of your numbers is bigger, that's why you find a more precise result.
Anyway, if you use single precision floats, you can't expect a better precision. With a double precision, you can find a precision up to 16. You shouldn't trust a difference between two numbers, unless it is bigger than the mantissa:
Simple precision floats: (a - b) / b >= ~1e-7
Double precision floats: (a - b) / b >= ~4e-16
For further information, see these examples ... or the table in this article ...