I am assessing the performance of a Fortran 90 code. Running the code through Intel's Advisor program I see that loops with the following style are not getting vectorized. An example of the loop structure is shown in the Subroutine and module files described below.
The code is being compiled with Intel's Compiler 19.0.3
-O3 optimization turned on
Subroutine SampleProblem
Use GlobalVariables
Implicit None
Integer :: ND, K, LP, L
Real :: AVTMP
! Sample of loop structure that is no vectorized
DO ND=1,NDM
DO K=1,KS
DO LP=1,LLWET(K,ND)
L = LKWET(LP,K,ND)
AVTMP = AVMX*HPI(L)
ENDDO
ENDDO
ENDDO
End Subroutine SampleProblem
LLWET and LKWET are allocatable arrays declared in a module 'GlobalVariables'. Something like:
Module GlobalVariables
Implicit None
! Variable declarations
REAL :: AVMX
INTEGER :: NDM
REAL,ALLOCATABLE,DIMENSION(:) :: HPI
INTEGER,ALLOCATABLE,DIMENSION(:,:) :: LLWET
INTEGER,ALLOCATABLE,DIMENSION(:,:,:) :: LKWET
End Module GlobalVariables
I don't see why this loop would not get vectorized by the compiler. There are many loops like his all over the code and none of them get vectorized, per the reported results of Intel's Advisor. I have tried forcing vecotoriztion with a !$SIMD block around the loop.
Related
I'm attempting to parallelize some Fortran 90 code using OpenACC, where a parallelized loop calls a sequential routine. When I attempt to run the code using the PGI Fortran compiler (2020.4), I obtain an error message saying that reference argument passing prevents parallelization.
My understanding is that this is likely because one routine exists on the Host while the other is on the Device, but I'm unclear on where I might be missing a pragma that would lead to this outcome.
The basic structure of the calling routine is:
subroutine OuterRoutine(F,G,X,Y)
real(wp), dimension(:,:), intent(IN) :: X
real(wp), dimension(:,:), intent(IN) :: Y
real(wp), dimension(1,PT), intent(OUT) :: F
real(wp), dimension(N_p,PT), intent(OUT) :: G
! Local Variables
integer :: t, i, j
!$acc data copyin(X,Y), copyout(F,G)
!$acc parallel loop
do t = 1,PT,1
!$acc loop collapse(2) reduction(+:intr)
do i = 1,N_int-1,1
do j = 1,N_int-1,1
G(i,j) = intgrdJ2(X(i,j),X(j,i),Y(i,j),Y(j,i),t)
end do
end do
!$acc end loop
!$acc end parallel loop
!$acc end data
end subroutine OuterRoutine
And the function being called is:
function intgrdJ2(z,mu,p,q,t)
!$acc routine seq
real(wp), intent(IN) :: z, mu, p, q
integer, intent(IN) :: t
real(wp) :: intgrdJ2
! Local Variables
real(wp) :: mu2
real(wp), dimension(N_p) :: nu_m2, psi_m2
integer :: i
mu2 = (mu*fh_pdf(z,mu,p))/f_pdf(z,mu,p)
do i = 1,N_p,1
nu_m2(i) = interpValue(mu2,mugrid,nu_knots(:,i,t))
psi_m2(i) = interpValue(mu2,mugrid,psi_knots(:,i,t))
end do
intgrdJ2 = nu_m2(i)*psi_m2(i)
end function intgrdJ2
The routines interpValue, fh_pdf, and f_pdf are all contained in a used module, and denoted as !$acc routine seq. The variables mugrid, nu_knots, and psi_knots are all module-level variables, which are copied-in to the Device prior to calling OuterRoutine.
When I run the code, I get this sort of output from the compiler:
intgrdj2:
576, Generating acc routine seq
Generating Tesla code
593, Reference argument passing prevents parallelization: mu2
Where 593 refers to the "nu_m2(i) = ..." line.
My understanding is that since the variable mu2 is a scalar declared inside of the sequential routine, each thread should have it's own copy of the variable, and I don't need to explicitly declare it to be private when I declare the data region. From reading this post it seems that the problem may be related to where the routines are located (Host vs Device). However, it seems as though all of the relevant pieces should be on the device because I'm specifying that routines are sequential.
As a first-time OpenACC user, any explanations about what I might be overlooking would be greatly appreciated!
My understanding is that since the variable mu2 is a scalar declared
inside of the sequential routine, each thread should have it's own
copy of the variable, and I don't need to explicitly declare it to be
private when I declare the data region
This is true in most cases. But what's likely happening here is that since Fortran by default passes variables by reference, the compiler must assume that it's reference can be taken by a module variable. Unlikely, but possible.
The typical way to fix this is to pass the scalar by value, i.e. add the "value" attribute to the argument declaration in "interpValue". Alternately, you can explicitly privatize "mu2" by adding "!$acc loop seq private(mu2)" on the "i" loop.
Now the message may just be indicating that the compiler can't auto-parallelize this loop. But since it's in a sequential routine, that wouldn't matter and you can safely ignore the message. Though, I don't have the full context so can't be 100% certain of this.
I am looking for a pure way to have access to time information. I thought about intrinsic functions and subroutines of standards compiler (date_and_time,cpu_time, system_clock, ltime, ctime, ...), the format do not really matter to me. I also thought about MPI function, but it is the same as intrinsic functions, there are all impure functions.
Here is a minimal example:
elemental subroutine add(message, text)
! function add
IMPLICIT NONE
character(len=:),allocatable,intent(inout) :: message
character(len=*), intent(in) :: text
character(len=1), parameter :: nl=char(10)
character(10) :: time
! character(8) :: date
! time= 'hhmmss.sss'
call DATE_AND_TIME(time)
Message= Message//time////text//nl
end subroutine add
and I get a logical error :
Error: Subroutine call to intrinsic ‘date_and_time’ at (1) is not PURE
Thus, I am wandering if a pure way to have a time information exists, or if it is impossible to have it purely (maybe because it has to use cpu information which, for a reason unknown to me, could be thread-unsafe).
Maybe, a subquestion, could be is there a solution to force the compiler to consider date_and_time pure (or any other function of that kind)?
The answer about a pure manner to get time is no. A function that returns the current time or date, is impure because at different times it will yield different results—it refers to some global state.
There are certainly tricks to persuade the compiler that a subroutine is pure. One is to flat out lie in an interface block.
But there are consequences from lying to the compiler. It can do optimizations which are unsafe to do and the results will be undefined (most often correct anyway, but...).
module m
contains
elemental subroutine add(text)
IMPLICIT NONE
character(len=*), intent(in) :: text
character(len=1), parameter :: nl=char(10)
character(10) :: time
intrinsic date_and_time
interface
pure subroutine my_date_and_time(time)
character(10), intent(out) :: time
end subroutine
end interface
call MY_DATE_AND_TIME(time)
end subroutine add
end module
program test
use m
call add("test")
end program
subroutine my_date_and_time(time)
character(10), intent(out) :: time
call date_and_time(time)
end subroutine
Notice I had to delete your message because that was absolutely incompatible with elemental.
The following code is returning a Segmentation Fault because the allocatable array I am trying to pass is not being properly recognized (size returns 1, when it should be 3). In this page (http://www.eng-tips.com/viewthread.cfm?qid=170599) a similar example seems to indicate that it should work fine in F95; my code file has a .F90 extension, but I tried changing it to F95, and I am using gfortran to compile.
My guess is that the problem should be in the way I am passing the allocatable array to the subroutine; What am I doing wrong?
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
PROGRAM test
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
IMPLICIT NONE
DOUBLE PRECISION,ALLOCATABLE :: Array(:,:)
INTEGER :: iii,jjj
ALLOCATE(Array(3,3))
DO iii=1,3
DO jjj=1,3
Array(iii,jjj)=iii+jjj
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
CALL Subtest(Array)
END PROGRAM
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
SUBROUTINE Subtest(Array)
DOUBLE PRECISION,ALLOCATABLE,INTENT(IN) :: Array(:,:)
INTEGER :: iii,jjj
PRINT*,SIZE(Array,1),SIZE(Array,2)
DO iii=1,SIZE(Array,1)
DO jjj=1,SIZE(Array,2)
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
END SUBROUTINE
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
If a procedure has a dummy argument that is an allocatable, then an explicit interface is required in any calling scope.
(There are numerous things that require an explicit interface, an allocatable dummy is but one.)
You can provide that explicit interface yourself by putting an interface block for your subroutine inside the main program. An alternative and far, far, far better option is to put the subroutine inside a module and then USE that module in the main program - the explicit interface is then automatically created. There is an example of this on the eng-tips site that you provided a link to - see the post by xwb.
Note that it only makes sense for a dummy argument to have the allocatable attribute if you are going to do something related to its allocation status - query its status, reallocate it, deallocate it, etc.
Please also note that your allocatable dummy argument array is declared with intent(in), which means its allocation status will be that of the associated actual argument (and it may not be changed during the procedure). The actual argument passed to your subroutine may be unallocated and therefore illegal to reference, even with an explicit interface. The compiler will not know this and the behaviour of inquiries like size is undefined in such cases.
Hence, you first have to check the allocation status of array with allocated(array) before referencing its contents. I would further suggest to implement loops over the full array with lbound and ubound, since in general you can't be sure about array's bounds:
subroutine subtest(array)
double precision, allocatable, intent(in) :: array(:,:)
integer :: iii, jjj
if(allocated(array)) then
print*, size(array, 1), size(array, 2)
do iii = lbound(array, 1), ubound(array, 1)
do jjj = lbound(array, 2), ubound(array, 2)
print*, array(iii,jjj)
enddo
enddo
endif
end subroutine
This is a simple example that uses allocatable dummy arguments with a module.
module arrayMod
real,dimension(:,:),allocatable :: theArray
end module arrayMod
program test
use arrayMod
implicit none
interface
subroutine arraySub
end subroutine arraySub
end interface
write(*,*) allocated(theArray)
call arraySub
write(*,*) allocated(theArray)
end program test
subroutine arraySub
use arrayMod
write(*,*) 'Inside arraySub()'
allocate(theArray(3,2))
end subroutine arraySub
I'm currently attempting to accelerate a spectral element fluids solver by porting most of the routines to a GPGPU using OpenACC with the PGI (15.10) compiler. The source code is written in OO-Fortran. This software has "layers" of subroutines that call other functions and subroutines. To bring the code over to a GPU using openacc, I've been first attempting to place "$acc routine" directives in each routine that needs to be ported. During compilation, using "pgf90 -acc -Minfo=accel", I receive the following error :
nvvmCompileProgram error: 9.
Error: /tmp/pgacc2lMnIf9lMqx8.gpu (146, 24): parse invalid forward reference to function 'innerroutine_' with wrong type!
PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (Test.f90: 1)
This same problem can be reproduced with the following simple fortran program :
PROGRAM Test
IMPLICIT NONE
CONTAINS
SUBROUTINE OuterRoutine( sol, xF, N )
!$acc routine
IMPLICIT NONE
INTEGER :: N
REAL(KIND=8) :: sol(0:N,1:3)
REAL(KIND=8) :: xF(0:N,1:3)
! LOCAL
INTEGER :: i
DO i = 0, N
xF(i,1:3) = InnerRoutine( sol(i,1:3) )
ENDDO
END SUBROUTINE OuterRoutine
FUNCTION InnerRoutine( sol ) RESULT( xF )
!$acc routine
IMPLICIT NONE
REAL(KIND=8) :: sol(1:3)
REAL(KIND=8) :: xF(1:3)
xF(1) = sol(1)*sol(2)
xF(2) = sol(1)*sol(3)
xF(3) = sol(1)*sol(1)
END FUNCTION InnerRoutine
END PROGRAM Test
Again, compiling the above program with "pgf90 -acc -Minfo=accel" yields the problem.
Does openacc support acc-enabled routines calling other acc-enabled routines ?
If so, what am I doing wrong ?
You're using the OpenACC "routine" directive correctly. The problem here is that we (PGI) don't yet support using "routine" with array-valued functions. The problem being that this support requires the compiler to create a temp array to hold the return value. Meaning that every thread would need to allocate this temp array causing a severe performance penalty. Worse is how to handle sharing the temp array if is a gang or worker level routine.
We do have open requests for this feature, but it may be awhile before we can address it. In the meantime, can you try inlining the routine? i.e. compile with "-Minline".
I am facing the following issue when running a hybrid MPI/OpenMP
code with GNU and Intel compilers and OpenMPI. The code is big (commercial)
written in Fortran. It compiles and runs fine with GNU compilers
but crashes with Intel compilers.
I have monitored the part of the code when the program stops working,
it has the following structure:
subroutine test(n,dy,dy)
integer :: i
integer, parameter :: n=6
real*8 :: dx(num),dy(num), ener
ener=0.0
!$omp parallel num_threads(2)
!$omp do
do i=1,100
ener = ener + funct(n,dx,dy) + i
enddo
!$omp end do
!$omp end parallel
end subroutine test
and the function funct has this structure:
real*8 function funct(n,dx,dy)
integer :: n
real*8 :: dx(*),dy(*)
funct = 0.0
do i=1,n
funct = funct + dx(i)+dy(i)
enddo
end function funct
Specifically the code stops inside funct (with Intel). The
program is able to get the end of funct but only one thread
of the two requested is able to return the value, I checked
that by printing the thread numbers.
This issue is only for Intel compilers, for GNU I don't get
the issue.
One way to avoid the issue, I found, is by using plain arrays
inside funct as follows:
real*8 function funct(n,dx,dy)
integer :: n
real*8 :: dx(n),dy(n)
but my point is that I don't understand what is happening.
My guess is that in the Intel case, the compiler cannot
figure out the length of dx and dy inside funct but I am
not sure. I tried to reproduce this issue with a small
Fortran program but I was not able to see that issue.
Any comment is welcome.
One update: I eliminated the issue with the race condition (this is
not the real problem, what I wrote here was the structure of the code).
I realized that subroutinetest is being called from another subroutine
upper which defines dx,dy as pointers:
subroutine upper
real*8,save,pointer :: dx(:)=>Null(), dy(:)=>Null()
....
call test(n,dx,dy)
...
end subroutine upper
what I did now, was to replace pointers by allocatables:
subroutine upper
real*8,save,dimension(:),allocatable :: dx,dy
....
allocate(dx(n),dy(n))
call test(n,dx,dy)
...
end subroutine upper
and I don't get the issue with Intel. I don't know what could be the
difference between pointers and allocatables.
Thanks.