Slow Float Multiplication in Fortran - fortran

I have a very basic Fortran question as I am a Fortran beginner. I'm using the following function:
real(8) function matchingformulaStk_c(ij,retwealth,iz,ctr,StockReturnM,stkEshare,stkNshare)
IMPLICIT NONE
integer, INTENT(IN) :: ij,iz,ctr
real(8), intent(in) :: StockReturnM,retwealth,stkEshare,stkNshare
real(8) :: R_existing, R_new
real(8), parameter :: p_vest = 0.95d0, match_level = 0.5d0, R401k = 1.02d0
R_existing = stkEshare*StockReturnM+(1d0-stkEshare)*R401k
R_new = stkNshare*StockReturnM+(1d0-stkNshare)*R401k
matchingformulaStk_c = R_existing*retwealth + R_new*ygrid(ij,iz)*((1+match_level*p_vest)*tau_grid(ctr))
end function matchingformulaStk_c
When I profile this function in a larger chunk of code I'm using with Intel VTune, I find that the line R_new = stkNshare*StockReturnM+(1d0-stkNshare)*R401k takes around 20x in CPU time as long as the line R_existing = stkEshare*StockReturnM+(1d0-stkEshare)*R401k.
Is there any obvious reason why this could be the case or is this going to be specific to my application? If there are any resources you could point me to as to figuring out how to speed up code hotspots like this one, that would be much appreciated.

Related

Dummy argument not agreeing with actual argument when passing function

I'm trying to implement Newton's method but I'm getting a confusing error message. In my code you'll see I called external with f1 and f2 which I assumes tells the computer to look for the function but it's treating them as variables based on the error message. I've read the stack overflow posts similar to my issue but none of the solutions seem to work. I've tried with and without the external but the issue still persists. Hoping someone could see what I'm missing.
implicit none
contains
subroutine solve(f1,f2,x0,n, EPSILON)
implicit none
real(kind = 2), external:: f1, f2
real (kind = 2), intent(in):: x0, EPSILON
real (kind = 2):: x
integer, intent(in):: n
integer:: iteration
x = x0
do while (abs(f1(x))>EPSILON)
iteration = iteration + 1
print*, iteration, x, f1(x)
x = x - (f1(x)/f2(x))
if (iteration >= n) then
print*, "No Convergence"
stop
end if
end do
print*, iteration, x
end subroutine solve
end module newton
Program Lab10
use newton
implicit none
integer, parameter :: n = 1000 ! maximum iteration
real(kind = 2), parameter :: EPSILON = 1.d-3
real(kind = 2):: x0, x
x0 = 3.0d0
call solve(f(x),fp(x),x0,n, EPSILON)
contains
real (kind = 2) function f(x) ! this is f(x)
implicit none
real (kind = 2), intent(in)::x
f = x**2.0d0-1.0d0
end function f
real (kind = 2) function fp(x) ! This is f'(x)
implicit none
real (kind = 2), intent(in)::x
fp = 2.0d0*x
end function fp
end program Lab10```
You seem to be passing function results to your subroutine and not the functions themselves. Remove (x) when calling solve() and the problem will be resolved. But more importantly, this code is a prime example of how to not program in Fortran. The attribute external is deprecated and you better provide an explicit interface. In addition, what is the meaning of kind = 2. Gfortran does not even comprehend it. Even if it comprehends the kind, it is not portable. Here is a correct portable modern implementation of the code,
module newton
use iso_fortran_env, only: RK => real64
implicit none
abstract interface
pure function f_proc(x) result(result)
import RK
real(RK), intent(in) :: x
real(RK) :: result
end function f_proc
end interface
contains
subroutine solve(f1,f2,x0,n, EPSILON)
procedure(f_proc) :: f1, f2
real(RK), intent(in) :: x0, EPSILON
integer, intent(in) :: n
real(RK) :: x
integer :: iteration
x = x0
do while (abs(f1(x))>EPSILON)
iteration = iteration + 1
print*, iteration, x, f1(x)
x = x - (f1(x)/f2(x))
if (iteration >= n) then
print*, "No Convergence"
stop
end if
end do
print*, iteration, x
end subroutine solve
end module newton
Program Lab10
use newton
integer, parameter :: n = 1000 ! maximum iteration
real(RK), parameter :: EPSILON = 1.e-3_RK
real(RK) :: x0, x
x0 = 3._RK
call solve(f,fp,x0,n, EPSILON)
contains
pure function f(x) result(result) ! this is f(x)
real (RK), intent(in) :: x
real (RK) :: result
result = x**2 - 1._RK
end function f
pure function fp(x) result(result) ! This is f'(x)
real (RK), intent(in) :: x
real (RK) :: result
result = 2 * x
end function fp
end program Lab10
If you expect to pass nonpure functions to the subroutine solve(), then remove the pure attribute. Note the use of real64 to declare 64-bit (double precision) real kind. Also notice how I have used _RK suffix to assign 64-bit precision to real constants. Also, notice I changed the exponents from real to integer as it is multiplication is more efficient than exponentiation computationally. I hope this answer serves more than merely the solution to Lab10.

Using the routine directive on a Fortran90 intrinsic

I have given OpenAcc access to the subroutine called "lang_force", and within that subroutine I call the Fortran90 intrinsic RANDOM_NUMBER().
subroutine lang_force(damp, temp, noise)
!$acc routine(lang_force)
implicit none
double precision, intent(in) :: damp, temp
double precision, dimension(1:3), intent(out) :: noise
double precision :: kb, a1, theta, phi, mag1, pi,r,s
integer :: i,j,k
double precision :: x,y,z
kb = 1.3806E-23
!kb = 1.0
pi=4.D0*DATAN(1.D0)
call random_number(a1)
call random_number(r)
call random_number(s)
mag1 = sqrt(-2.0*log(a1))
theta = r*pi
phi = 2.0*s*pi
x = mag1*cos(phi)*sin(theta)
y = mag1*sin(theta)*sin(phi)
z = mag1*cos(theta)
noise(1) = sqrt(2.0*kb*temp*damp)*x
noise(2) = sqrt(2.0*kb*temp*damp)*y
noise(3) = sqrt(2.0*kb*temp*damp)*z
end subroutine lang_force
When compiled with the latest version of pdf90, it tells me that it needs access to RANDOM_NUMBER(). How do I declare a routine directive to such a fortran90 intrinsic subroutine?
Not all Fortran intrinsics are supported within device code including RANDOM_NUMBER.
RANDOM_NUMBER in particular is not thread safe since all threads would share the same state. Instead, you need to use cuRand for which we ship examples with the compilers under the "2020/examples/CUDA-Libraries/cuRAND/test_rand_oacc_ftn" directory.
While this is in C, I wrote about is in more detail on the NVIDIA user forums:
https://forums.developer.nvidia.com/t/random-numbers-on-device-generation/135748/2

Using MKL to solve a non-linear system of equations with an objective function stored in another module

I'm trying to use the MKL trust region algorithm to solve a nonlinear system of equations in a Fortran program. I started from the example provided online (ex_nlsqp_f90_x.f90 https://software.intel.com/en-us/node/501498) and everything works correctly. Now, because I have to use this in a much bigger program, I need the user defined objective function to be loaded from a separate module. Hence, I split the example into 2 separate files, but I'm not able to make it compile correctly.
So here is the code for module which contains user defined data structure and the objective function
module modFun
implicit none
private
public my_data, extended_powell
type :: my_data
integer a
integer sum
end type my_data
contains
subroutine extended_powell (m, n, x, f, user_data)
implicit none
integer, intent(in) :: m, n
real*8 , intent(in) :: x(n)
real*8, intent(out) :: f(m)
type(my_data) :: user_data
integer i
user_data%sum = user_data%sum + user_data%a
do i = 1, n/4
f(4*(i-1)+1) = x(4*(i-1)+1) + 10.0 * x(4*(i-1)+2)
f(4*(i-1)+2) = 2.2360679774998 * (x(4*(i-1)+3) - x(4*(i-1)+4))
f(4*(i-1)+3) = ( x(4*(i-1)+2) - 2.0 * x(4*(i-1)+3) )**2
f(4*(i-1)+4) = 3.1622776601684 * (x(4*(i-1)+1) - x(4*(i-1)+4))**2
end do
end subroutine extended_powell
end module modFun
and here the portion of the main program calling it
include 'mkl_rci.f90'
program EXAMPLE_EX_NLSQP_F90_X
use MKL_RCI
use MKL_RCI_type
use modFun
! user's objective function
! n - number of function variables
! m - dimension of function value
integer n, m
parameter (n = 4)
parameter (m = 4)
! precisions for stop-criteria (see manual for more details)
real*8 eps(6)
real*8 x(n)
real*8 fjac(m*n)
! number of iterations
integer fun
! Additional users data
type(my_data) :: m_data
m_data%a = 1
m_data%sum = 0
rs = 0.0
fun = djacobix(extended_powell,n,m,fjac,x,eps(1),%val(loc(m_data)))
end program EXAMPLE_EX_NLSQP_F90_X
Also djacobix code
INTERFACE
INTEGER FUNCTION DJACOBIX(fcn, n, m, fjac, x, eps, user_data)
USE, INTRINSIC :: ISO_C_BINDING
INTEGER, INTENT(IN) :: n
INTEGER, INTENT(IN) :: m
DOUBLE PRECISION, INTENT(IN) :: eps
DOUBLE PRECISION, INTENT(IN), DIMENSION(*) :: x
DOUBLE PRECISION, INTENT(OUT), DIMENSION(m, *) :: fjac
INTEGER(C_INTPTR_T) :: user_data
INTERFACE
SUBROUTINE fcn(m, n, x, f, user_data)
USE, INTRINSIC :: ISO_C_BINDING
INTEGER, INTENT(IN) :: n
INTEGER, INTENT(IN) :: m
DOUBLE PRECISION, INTENT(IN), DIMENSION(*) :: x
DOUBLE PRECISION, INTENT(OUT), DIMENSION(*) :: f
INTEGER(C_INTPTR_T), INTENT(IN) :: user_data
END SUBROUTINE
END INTERFACE
END FUNCTION
END INTERFACE
When i compile the following errors are generated:
mpiifort -g -t -mkl -I/apps/rhel6/intel/composer_xe_2015.3.187/mkl/include/intel64/lp64 -c modFun.f90
mpiifort -g -t -mkl -I/apps/rhel6/intel/composer_xe_2015.3.187/mkl/include/intel64/lp64 -c main.f90
main.f90(30): error #7065: The characteristics of dummy argument 5 of the associated actual procedure differ from the characteristics of dummy argument 5 of the dummy procedure. [EXTENDED_POWELL]
fun = djacobix(extended_powell,n,m,fjac,x,eps(1),%val(loc(m_data)))
-------------------^
I have the feeling I have to create an interface to override the check on the m_data, but I can't figure out where and how. Can anyone help me with this problem providing a working example?
I guess the reason is that the function djacobix passes the pointer instead of the true value of variable user_data.
You can check the manual at https://software.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/nonlinear-optimization-problem-solvers/jacobian-matrix-calculation-routines/jacobix.html where a sentence shows that "You need to declare fcn as extern in the calling program."

OpenBLAS slower than intrinsic function dot_product

I need make a dot product in Fortran. I can do with the intrinsic function dot_product from Fortran or use ddot from OpenBLAS. The problem is the ddot is slower. This is my code:
With BLAS:
program VectorBLAS
! time VectorBlas.e = 0.30s
implicit none
double precision, dimension(3) :: b
double precision :: result
double precision, external :: ddot
integer, parameter :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K) :: I
DO I = 1, 10000000
b(:) = 3
result = ddot(3, b, 1, b, 1)
END DO
end program VectorBLAS
With dot_product
program VectorModule
! time VectorModule.e = 0.19s
implicit none
double precision, dimension (3) :: b
double precision :: result
integer, parameter :: LargeInt_K = selected_int_kind (18)
integer (kind=LargeInt_K) :: I
DO I = 1, 10000000
b(:) = 3
result = dot_product(b, b)
END DO
end program VectorModule
The two codes are compiled using:
gfortran file_name.f90 -lblas -o file_name.e
What am I doing wrong? BLAS not have to be faster?
While BLAS, and especially the optimized versions, are generally faster for larger arrays, the built-in functions are faster for smaller sizes.
This is especially visible from the linked source code of ddot, where additional work is spent on further functionality (e.g., different increments). For small array lengths, the work done here outweighs the performance gain of the optimizations.
If you make your vectors (much) larger, the optimized version should be faster.
Here is an example to illustrate this:
program test
use, intrinsic :: ISO_Fortran_env, only: REAL64
implicit none
integer :: t1, t2, rate, ttot1, ttot2, i
real(REAL64), allocatable :: a(:),b(:),c(:)
real(REAL64), external :: ddot
allocate( a(100000), b(100000), c(100000) )
call system_clock(count_rate=rate)
ttot1 = 0 ; ttot2 = 0
do i=1,1000
call random_number(a)
call random_number(b)
call system_clock(t1)
c = dot_product(a,b)
call system_clock(t2)
ttot1 = ttot1 + t2 - t1
call system_clock(t1)
c = ddot(100000,a,1,b,1)
call system_clock(t2)
ttot2 = ttot2 + t2 - t1
enddo
print *,'dot_product: ', real(ttot1)/real(rate)
print *,'BLAS, ddot: ', real(ttot2)/real(rate)
end program
The BLAS routines are quite a bit faster here:
OMP_NUM_THREADS=1 ./a.out
dot_product: 0.145999998
BLAS, ddot: 0.100000001

Sympy - generate fortran code with one dimensional array

I was trying to use sympy - codegen to get a fortran code. It works fine except for one annoying thing for which I can`t find a solution. I simplified my example to this:
bar = Matrix([x*x,y*z,z*y])
result = codegen(('foo', bar), 'f95', 'project')
the results is:
subroutine foo(x, y, z, out_6551735094710235777)
implicit none
REAL*8, intent(in) :: x
REAL*8, intent(in) :: y
REAL*8, intent(in) :: z
REAL*8, intent(out), dimension(1:3, 1:1) :: out_6551735094710235777
out_6551735094710235777(1, 1) = x**2
out_6551735094710235777(2, 1) = y*z
out_6551735094710235777(3, 1) = y*z
end subroutine
The output of the routine is a two dimensional array. Is there a way to make it single dimensional - for this case the second dimension is one anyway. If I generate the C code for this example I am getting a one dimensional vector by default. Why for fortran it is different?
Also, how to define a name for an output for case like this instead of the auto generated one?