I'm attempting to parallelize some Fortran 90 code using OpenACC, where a parallelized loop calls a sequential routine. When I attempt to run the code using the PGI Fortran compiler (2020.4), I obtain an error message saying that reference argument passing prevents parallelization.
My understanding is that this is likely because one routine exists on the Host while the other is on the Device, but I'm unclear on where I might be missing a pragma that would lead to this outcome.
The basic structure of the calling routine is:
subroutine OuterRoutine(F,G,X,Y)
real(wp), dimension(:,:), intent(IN) :: X
real(wp), dimension(:,:), intent(IN) :: Y
real(wp), dimension(1,PT), intent(OUT) :: F
real(wp), dimension(N_p,PT), intent(OUT) :: G
! Local Variables
integer :: t, i, j
!$acc data copyin(X,Y), copyout(F,G)
!$acc parallel loop
do t = 1,PT,1
!$acc loop collapse(2) reduction(+:intr)
do i = 1,N_int-1,1
do j = 1,N_int-1,1
G(i,j) = intgrdJ2(X(i,j),X(j,i),Y(i,j),Y(j,i),t)
end do
end do
!$acc end loop
!$acc end parallel loop
!$acc end data
end subroutine OuterRoutine
And the function being called is:
function intgrdJ2(z,mu,p,q,t)
!$acc routine seq
real(wp), intent(IN) :: z, mu, p, q
integer, intent(IN) :: t
real(wp) :: intgrdJ2
! Local Variables
real(wp) :: mu2
real(wp), dimension(N_p) :: nu_m2, psi_m2
integer :: i
mu2 = (mu*fh_pdf(z,mu,p))/f_pdf(z,mu,p)
do i = 1,N_p,1
nu_m2(i) = interpValue(mu2,mugrid,nu_knots(:,i,t))
psi_m2(i) = interpValue(mu2,mugrid,psi_knots(:,i,t))
end do
intgrdJ2 = nu_m2(i)*psi_m2(i)
end function intgrdJ2
The routines interpValue, fh_pdf, and f_pdf are all contained in a used module, and denoted as !$acc routine seq. The variables mugrid, nu_knots, and psi_knots are all module-level variables, which are copied-in to the Device prior to calling OuterRoutine.
When I run the code, I get this sort of output from the compiler:
intgrdj2:
576, Generating acc routine seq
Generating Tesla code
593, Reference argument passing prevents parallelization: mu2
Where 593 refers to the "nu_m2(i) = ..." line.
My understanding is that since the variable mu2 is a scalar declared inside of the sequential routine, each thread should have it's own copy of the variable, and I don't need to explicitly declare it to be private when I declare the data region. From reading this post it seems that the problem may be related to where the routines are located (Host vs Device). However, it seems as though all of the relevant pieces should be on the device because I'm specifying that routines are sequential.
As a first-time OpenACC user, any explanations about what I might be overlooking would be greatly appreciated!
My understanding is that since the variable mu2 is a scalar declared
inside of the sequential routine, each thread should have it's own
copy of the variable, and I don't need to explicitly declare it to be
private when I declare the data region
This is true in most cases. But what's likely happening here is that since Fortran by default passes variables by reference, the compiler must assume that it's reference can be taken by a module variable. Unlikely, but possible.
The typical way to fix this is to pass the scalar by value, i.e. add the "value" attribute to the argument declaration in "interpValue". Alternately, you can explicitly privatize "mu2" by adding "!$acc loop seq private(mu2)" on the "i" loop.
Now the message may just be indicating that the compiler can't auto-parallelize this loop. But since it's in a sequential routine, that wouldn't matter and you can safely ignore the message. Though, I don't have the full context so can't be 100% certain of this.
Related
On a Fortran program accelerated with OpenACC, I need to duplicate an array on GPU. The duplicated array will only be used on GPU and will never be copied on host. The only way I know to create it would be to declare and allocate it on host, then acc data create it:
program test
implicit none
integer, parameter :: n = 1000
real :: total
real, allocatable :: array(:)
real, allocatable :: array_d(:)
allocate(array(n))
allocate(array_d(n))
array(:) = 1e0
!$acc data copy(array) create(array_d) copyout(total)
!$acc kernels
array_d(:) = array(:)
!$acc end kernels
!$acc kernels
total = sum(array_d)
!$acc end kernels
!$acc end data
print *, sum(array)
print *, total
deallocate(array)
deallocate(array_d)
end program
This is an illustration code, as the program in question is much more complex.
The problem with this solution is that I have to allocate the duplicated array on host, even if I do not use it here. Some host memory would be wasted, especially for large arrays (even if I know I would run out of device memory before running out of host memory). On CUDA Fortran, I know I can declare a device only array, but I do not know if this is possible with OpenACC.
Is there a better way to perform this?
The OpenACC spec has the "acc declare device_resident" which allocates a device only array which you'd use instead of a "data create". Something like:
implicit none
integer, parameter :: n = 1000
real :: total
real, allocatable :: array(:)
real, allocatable :: array_d(:)
!$acc declare device_resident(array_d)
allocate(array(n))
allocate(array_d(n))
array(:) = 1e0
!$acc data copy(array) copyout(total)
!$acc kernels
array_d(:) = array(:)
!$acc end kernels
!$acc kernels
total = sum(array_d)
!$acc end kernels
!$acc end data
print *, sum(array)
print *, total
deallocate(array)
deallocate(array_d)
end program
Though due to complexity in implementation and lack of compelling use case, our compiler (NVHPC aka PGI) treats device_resident as a create, i.e the host array is still allocated. So if you're using NVHPC and truly need a device only array, then you'll want to use a CUDA Fortran "device" attribute on the array. CUDA Fortran and OpenACC are interoperable, so it's fine to mix them.
However, wasting a bit of host memory isn't an issue for the vast majority of codes, and since no data is copied, there's no performance impact. Hence if you kept the code as is, it shouldn't be a problem.
I am assessing the performance of a Fortran 90 code. Running the code through Intel's Advisor program I see that loops with the following style are not getting vectorized. An example of the loop structure is shown in the Subroutine and module files described below.
The code is being compiled with Intel's Compiler 19.0.3
-O3 optimization turned on
Subroutine SampleProblem
Use GlobalVariables
Implicit None
Integer :: ND, K, LP, L
Real :: AVTMP
! Sample of loop structure that is no vectorized
DO ND=1,NDM
DO K=1,KS
DO LP=1,LLWET(K,ND)
L = LKWET(LP,K,ND)
AVTMP = AVMX*HPI(L)
ENDDO
ENDDO
ENDDO
End Subroutine SampleProblem
LLWET and LKWET are allocatable arrays declared in a module 'GlobalVariables'. Something like:
Module GlobalVariables
Implicit None
! Variable declarations
REAL :: AVMX
INTEGER :: NDM
REAL,ALLOCATABLE,DIMENSION(:) :: HPI
INTEGER,ALLOCATABLE,DIMENSION(:,:) :: LLWET
INTEGER,ALLOCATABLE,DIMENSION(:,:,:) :: LKWET
End Module GlobalVariables
I don't see why this loop would not get vectorized by the compiler. There are many loops like his all over the code and none of them get vectorized, per the reported results of Intel's Advisor. I have tried forcing vecotoriztion with a !$SIMD block around the loop.
I am writing code to add on a closed-source Finite-Element Framework that forces me (due to relying on some old F77 style approaches) in one place to rely on assumed-size arrays.
Is it possible to write an assumed-size array to the standard output, whatever its size may be?
This is not working:
module fun
implicit none
contains
subroutine writer(a)
integer, dimension(*), intent(in) :: a
write(*,*) a
end subroutine writer
end module fun
program test
use fun
implicit none
integer, dimension(2) :: a
a(1) = 1
a(2) = 2
call writer(a)
end program test
With the Intel Fortran compiler throwing
error #6364: The upper bound shall not be omitted in the last dimension of a reference to an assumed size array.
The compiler does not know how large an assumed-size array is. It has only the address of the first element. You are responsible to tell how large it is.
write(*,*) a(1:n)
Equivalently you can use an explicit-size array
integer, intent(in) :: a(n)
and then you can do
write(*,*) a
An assumed-size array may not occur as a whole array reference when that reference requires the shape of the array. As an output item in a write statement that is one such disallowed case.
So, in that sense the answer is: no, it is not possible to have the write statement as you have it.
From an assumed-size array, array sections and array elements may appear:
write (*,*) a(1:2)
write (*,*) a(1), a(2)
write (*,*) (a(i), i=1,2)
leading simply to how to get the value 2 into the subroutine; at other times it may be 7 required. Let's call it n.
Naturally, changing the subroutine is tempting:
subroutine writer (a,n)
integer n
integer a(n) ! or still a(*)
end subroutine
or even
subroutine writer (a)
integer a(:)
end subroutine
One often hasn't a choice, alas, in particular when associating a procedure with a dummy procedure with a specific interface . However, n can get into the subroutine through any of several other ways: as a module or host entity, or through a common block (avoid this one if possible). These methods do not require modifying the interface of the procedure. For example:
subroutine writer(a)
use aux_params, only : n
integer, dimension(*), intent(in) :: a
write(*,*) a(1:n)
end subroutine writer
or we could have n as an entity in the module fun and have it accesible in writer through host association. In either case, setting this n's value in the main program before writer is executed will be necessary.
The following code is returning a Segmentation Fault because the allocatable array I am trying to pass is not being properly recognized (size returns 1, when it should be 3). In this page (http://www.eng-tips.com/viewthread.cfm?qid=170599) a similar example seems to indicate that it should work fine in F95; my code file has a .F90 extension, but I tried changing it to F95, and I am using gfortran to compile.
My guess is that the problem should be in the way I am passing the allocatable array to the subroutine; What am I doing wrong?
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
PROGRAM test
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
IMPLICIT NONE
DOUBLE PRECISION,ALLOCATABLE :: Array(:,:)
INTEGER :: iii,jjj
ALLOCATE(Array(3,3))
DO iii=1,3
DO jjj=1,3
Array(iii,jjj)=iii+jjj
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
CALL Subtest(Array)
END PROGRAM
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
SUBROUTINE Subtest(Array)
DOUBLE PRECISION,ALLOCATABLE,INTENT(IN) :: Array(:,:)
INTEGER :: iii,jjj
PRINT*,SIZE(Array,1),SIZE(Array,2)
DO iii=1,SIZE(Array,1)
DO jjj=1,SIZE(Array,2)
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
END SUBROUTINE
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
If a procedure has a dummy argument that is an allocatable, then an explicit interface is required in any calling scope.
(There are numerous things that require an explicit interface, an allocatable dummy is but one.)
You can provide that explicit interface yourself by putting an interface block for your subroutine inside the main program. An alternative and far, far, far better option is to put the subroutine inside a module and then USE that module in the main program - the explicit interface is then automatically created. There is an example of this on the eng-tips site that you provided a link to - see the post by xwb.
Note that it only makes sense for a dummy argument to have the allocatable attribute if you are going to do something related to its allocation status - query its status, reallocate it, deallocate it, etc.
Please also note that your allocatable dummy argument array is declared with intent(in), which means its allocation status will be that of the associated actual argument (and it may not be changed during the procedure). The actual argument passed to your subroutine may be unallocated and therefore illegal to reference, even with an explicit interface. The compiler will not know this and the behaviour of inquiries like size is undefined in such cases.
Hence, you first have to check the allocation status of array with allocated(array) before referencing its contents. I would further suggest to implement loops over the full array with lbound and ubound, since in general you can't be sure about array's bounds:
subroutine subtest(array)
double precision, allocatable, intent(in) :: array(:,:)
integer :: iii, jjj
if(allocated(array)) then
print*, size(array, 1), size(array, 2)
do iii = lbound(array, 1), ubound(array, 1)
do jjj = lbound(array, 2), ubound(array, 2)
print*, array(iii,jjj)
enddo
enddo
endif
end subroutine
This is a simple example that uses allocatable dummy arguments with a module.
module arrayMod
real,dimension(:,:),allocatable :: theArray
end module arrayMod
program test
use arrayMod
implicit none
interface
subroutine arraySub
end subroutine arraySub
end interface
write(*,*) allocated(theArray)
call arraySub
write(*,*) allocated(theArray)
end program test
subroutine arraySub
use arrayMod
write(*,*) 'Inside arraySub()'
allocate(theArray(3,2))
end subroutine arraySub
Why the following code in Fortran only works if I put the loop variables 'i' and 'j' as input arguments of the subroutine 'mat_init'? The loop variables 'i' and 'j' are declared as private, so shouldn't they remain private inside the subroutine when I call it?
program main
use omp_lib
implicit none
real(8), dimension(:,:), allocatable:: A
integer:: i, j, n
n = 20
allocate(A(n,n)); A(:,:) = 0.0d+00
!$omp parallel do private(i, j)
do i=1,n
do j=1,n
call mat_init
end do
end do
do i=1,n
write(*,'(20f7.4)') (A(i,j), j=1,n)
end do
contains
subroutine mat_init
A(i,j) = 1.0d+00
end subroutine
end program main
I know this have something to do with the 'lexical' and 'dynamic' extend, but I don't understand why OpenMP is implemented in this way to don't recognize private variables in the 'dymanic' extend inside de parallel regions. For me it seems not to be logical or am I doing anything wrong?
First, I think that the subroutine mat_init should takes the value of i and j as input arguments explicitly. Then, the value of i and j must be private, because each thread works on a specific value of i and j. I think also that openmp recognizes that i is private because the parallelized loop is on i. Idem for j. However, this work for the global variables i and j and not for those ones who are internal to the subroutine. Thus, you have to specify that i and j are private in order to force the subroutine internal variables to inhiritate of this aspect.
I believe that the problem is due to the reentrance of the subroutine mat_init. Indeed, what happen when multiple threads enter the subroutine at the same time with different value of i and j? If you don't do any special thing, the called subroutine might not recognize the private aspect of i and j.
In general, it is not welcomed to call many times a subroutine inside a loop, because each call requires a given time. I suggest to write a subroutine that is parallelized rather than call a subroutine within a parallelized section.