On a Fortran program accelerated with OpenACC, I need to duplicate an array on GPU. The duplicated array will only be used on GPU and will never be copied on host. The only way I know to create it would be to declare and allocate it on host, then acc data create it:
program test
implicit none
integer, parameter :: n = 1000
real :: total
real, allocatable :: array(:)
real, allocatable :: array_d(:)
allocate(array(n))
allocate(array_d(n))
array(:) = 1e0
!$acc data copy(array) create(array_d) copyout(total)
!$acc kernels
array_d(:) = array(:)
!$acc end kernels
!$acc kernels
total = sum(array_d)
!$acc end kernels
!$acc end data
print *, sum(array)
print *, total
deallocate(array)
deallocate(array_d)
end program
This is an illustration code, as the program in question is much more complex.
The problem with this solution is that I have to allocate the duplicated array on host, even if I do not use it here. Some host memory would be wasted, especially for large arrays (even if I know I would run out of device memory before running out of host memory). On CUDA Fortran, I know I can declare a device only array, but I do not know if this is possible with OpenACC.
Is there a better way to perform this?
The OpenACC spec has the "acc declare device_resident" which allocates a device only array which you'd use instead of a "data create". Something like:
implicit none
integer, parameter :: n = 1000
real :: total
real, allocatable :: array(:)
real, allocatable :: array_d(:)
!$acc declare device_resident(array_d)
allocate(array(n))
allocate(array_d(n))
array(:) = 1e0
!$acc data copy(array) copyout(total)
!$acc kernels
array_d(:) = array(:)
!$acc end kernels
!$acc kernels
total = sum(array_d)
!$acc end kernels
!$acc end data
print *, sum(array)
print *, total
deallocate(array)
deallocate(array_d)
end program
Though due to complexity in implementation and lack of compelling use case, our compiler (NVHPC aka PGI) treats device_resident as a create, i.e the host array is still allocated. So if you're using NVHPC and truly need a device only array, then you'll want to use a CUDA Fortran "device" attribute on the array. CUDA Fortran and OpenACC are interoperable, so it's fine to mix them.
However, wasting a bit of host memory isn't an issue for the vast majority of codes, and since no data is copied, there's no performance impact. Hence if you kept the code as is, it shouldn't be a problem.
Related
I'm attempting to parallelize some Fortran 90 code using OpenACC, where a parallelized loop calls a sequential routine. When I attempt to run the code using the PGI Fortran compiler (2020.4), I obtain an error message saying that reference argument passing prevents parallelization.
My understanding is that this is likely because one routine exists on the Host while the other is on the Device, but I'm unclear on where I might be missing a pragma that would lead to this outcome.
The basic structure of the calling routine is:
subroutine OuterRoutine(F,G,X,Y)
real(wp), dimension(:,:), intent(IN) :: X
real(wp), dimension(:,:), intent(IN) :: Y
real(wp), dimension(1,PT), intent(OUT) :: F
real(wp), dimension(N_p,PT), intent(OUT) :: G
! Local Variables
integer :: t, i, j
!$acc data copyin(X,Y), copyout(F,G)
!$acc parallel loop
do t = 1,PT,1
!$acc loop collapse(2) reduction(+:intr)
do i = 1,N_int-1,1
do j = 1,N_int-1,1
G(i,j) = intgrdJ2(X(i,j),X(j,i),Y(i,j),Y(j,i),t)
end do
end do
!$acc end loop
!$acc end parallel loop
!$acc end data
end subroutine OuterRoutine
And the function being called is:
function intgrdJ2(z,mu,p,q,t)
!$acc routine seq
real(wp), intent(IN) :: z, mu, p, q
integer, intent(IN) :: t
real(wp) :: intgrdJ2
! Local Variables
real(wp) :: mu2
real(wp), dimension(N_p) :: nu_m2, psi_m2
integer :: i
mu2 = (mu*fh_pdf(z,mu,p))/f_pdf(z,mu,p)
do i = 1,N_p,1
nu_m2(i) = interpValue(mu2,mugrid,nu_knots(:,i,t))
psi_m2(i) = interpValue(mu2,mugrid,psi_knots(:,i,t))
end do
intgrdJ2 = nu_m2(i)*psi_m2(i)
end function intgrdJ2
The routines interpValue, fh_pdf, and f_pdf are all contained in a used module, and denoted as !$acc routine seq. The variables mugrid, nu_knots, and psi_knots are all module-level variables, which are copied-in to the Device prior to calling OuterRoutine.
When I run the code, I get this sort of output from the compiler:
intgrdj2:
576, Generating acc routine seq
Generating Tesla code
593, Reference argument passing prevents parallelization: mu2
Where 593 refers to the "nu_m2(i) = ..." line.
My understanding is that since the variable mu2 is a scalar declared inside of the sequential routine, each thread should have it's own copy of the variable, and I don't need to explicitly declare it to be private when I declare the data region. From reading this post it seems that the problem may be related to where the routines are located (Host vs Device). However, it seems as though all of the relevant pieces should be on the device because I'm specifying that routines are sequential.
As a first-time OpenACC user, any explanations about what I might be overlooking would be greatly appreciated!
My understanding is that since the variable mu2 is a scalar declared
inside of the sequential routine, each thread should have it's own
copy of the variable, and I don't need to explicitly declare it to be
private when I declare the data region
This is true in most cases. But what's likely happening here is that since Fortran by default passes variables by reference, the compiler must assume that it's reference can be taken by a module variable. Unlikely, but possible.
The typical way to fix this is to pass the scalar by value, i.e. add the "value" attribute to the argument declaration in "interpValue". Alternately, you can explicitly privatize "mu2" by adding "!$acc loop seq private(mu2)" on the "i" loop.
Now the message may just be indicating that the compiler can't auto-parallelize this loop. But since it's in a sequential routine, that wouldn't matter and you can safely ignore the message. Though, I don't have the full context so can't be 100% certain of this.
When a situation such as described in Incorrect fortran errors: allocatable array is already allocated; DEALLOCATE points to an array that cannot be deallocated happens (corrupted memory leaves an allocatable array that appears allocated but does not "point" to a valid address), is there anything that can be done within Fortran to cure it, i.e., reset the array as deallocated, without trying to deallocate the memory it points to?
The situation is a Fortran/C program where a piece of C code purposefully corrupts (writes garbage to) allocated memory. This works fine for arrays of normal types. But with an allocatable array of a user-defined type, which includes itself an allocatable component, the garbage written to the portion belonging to the allocatable component means that now the component appears as allocated, even though it's not. Rather than making the C code aware of what it should corrupt or not, I'd prefer fixing it after, but "nullifying" the allocatable component, when I know I don't care about the memory it currently appears to point to. With a pointer, it would be just a matter of nullify, but with an allocatable array?
If the memory is really corrupted as in stack corruption/heap corruption. You cannot do anything. The program is bound to fail because the very low-level information is lost. This is true for any programming language, even C.
If, what is corrupted, is the Fortran array descriptor, you cannot correct it from Fortran. Fortran does not expose these implementation details to Fortran programmers. It is only available via special headers called ISO_Fortran_binding.h from C.
If the only corruption that happened was making Fortran thing that the array is allocated where it isn't, it should be rather simple to revert that from C. All it should be necessary is to change the address of the allocated memory. Allocatable arrays are always contiguous.
One could also try dirty tricks like telling a subroutine that what you are passing is a pointer when it in fact is an allocatable and nullify it. It will likely work in many implementations. But nullifying the address in a controllable way is much cleaner. Even if it is just a one nullifying C function you call from Fortran.
Because you really only want to change the address to 0 and not make any other special stuff with the array extents, strides and other details, it should be simple to do even without the header.
Note that the descriptor will still contain nonsense data in other variables, but those should not matter.
This is a quick and dirty test:
Fortran:
dimension A(:,:)
allocatable A
interface
subroutine write_garbage(A) bind(C)
dimension A(:,:)
allocatable A
end subroutine
subroutine c_null_alloc(A) bind(C)
dimension A(:,:)
allocatable A
end subroutine
end interface
call write_garbage(A)
print *, allocated(A)
call c_null_alloc(A)
print *, allocated(A)
end
C:
#include <stdint.h>
void write_garbage(intptr_t* A){
*A = 999;
}
void c_null_alloc(intptr_t* A){
*A = 0;
}
result:
> gfortran c_allocatables.c c_allocatables.f90
> ./a.out
T
F
A proper version should use ISO_Fortran_binding.h if your compiler provides it. And implicit none and other boring stuff...
A very dirty (and illegal) hack that I do not recommend at all:
dimension A(:,:)
allocatable A
interface
subroutine write_garbage(A) bind(C)
dimension A(:,:)
allocatable A
end subroutine
subroutine null_alloc(A) bind(C)
dimension A(:,:)
allocatable A
end subroutine
end interface
call write_garbage(A)
print *, allocated(A)
call null_alloc(A)
print *, allocated(A)
end
subroutine null_alloc(A) bind(C)
dimension A(:,:)
pointer A
A => null()
end subroutine
> gfortran c_allocatables.c c_allocatables.f90
c_allocatables.f90:27:21:
10 | subroutine null_alloc(A) bind(C)
| 2
......
27 | subroutine null_alloc(A) bind(C)
| 1
Warning: ALLOCATABLE mismatch in argument 'a' between (1) and (2)
> ./a.out
T
F
I am writing code to add on a closed-source Finite-Element Framework that forces me (due to relying on some old F77 style approaches) in one place to rely on assumed-size arrays.
Is it possible to write an assumed-size array to the standard output, whatever its size may be?
This is not working:
module fun
implicit none
contains
subroutine writer(a)
integer, dimension(*), intent(in) :: a
write(*,*) a
end subroutine writer
end module fun
program test
use fun
implicit none
integer, dimension(2) :: a
a(1) = 1
a(2) = 2
call writer(a)
end program test
With the Intel Fortran compiler throwing
error #6364: The upper bound shall not be omitted in the last dimension of a reference to an assumed size array.
The compiler does not know how large an assumed-size array is. It has only the address of the first element. You are responsible to tell how large it is.
write(*,*) a(1:n)
Equivalently you can use an explicit-size array
integer, intent(in) :: a(n)
and then you can do
write(*,*) a
An assumed-size array may not occur as a whole array reference when that reference requires the shape of the array. As an output item in a write statement that is one such disallowed case.
So, in that sense the answer is: no, it is not possible to have the write statement as you have it.
From an assumed-size array, array sections and array elements may appear:
write (*,*) a(1:2)
write (*,*) a(1), a(2)
write (*,*) (a(i), i=1,2)
leading simply to how to get the value 2 into the subroutine; at other times it may be 7 required. Let's call it n.
Naturally, changing the subroutine is tempting:
subroutine writer (a,n)
integer n
integer a(n) ! or still a(*)
end subroutine
or even
subroutine writer (a)
integer a(:)
end subroutine
One often hasn't a choice, alas, in particular when associating a procedure with a dummy procedure with a specific interface . However, n can get into the subroutine through any of several other ways: as a module or host entity, or through a common block (avoid this one if possible). These methods do not require modifying the interface of the procedure. For example:
subroutine writer(a)
use aux_params, only : n
integer, dimension(*), intent(in) :: a
write(*,*) a(1:n)
end subroutine writer
or we could have n as an entity in the module fun and have it accesible in writer through host association. In either case, setting this n's value in the main program before writer is executed will be necessary.
The following code is returning a Segmentation Fault because the allocatable array I am trying to pass is not being properly recognized (size returns 1, when it should be 3). In this page (http://www.eng-tips.com/viewthread.cfm?qid=170599) a similar example seems to indicate that it should work fine in F95; my code file has a .F90 extension, but I tried changing it to F95, and I am using gfortran to compile.
My guess is that the problem should be in the way I am passing the allocatable array to the subroutine; What am I doing wrong?
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
PROGRAM test
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
IMPLICIT NONE
DOUBLE PRECISION,ALLOCATABLE :: Array(:,:)
INTEGER :: iii,jjj
ALLOCATE(Array(3,3))
DO iii=1,3
DO jjj=1,3
Array(iii,jjj)=iii+jjj
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
CALL Subtest(Array)
END PROGRAM
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
SUBROUTINE Subtest(Array)
DOUBLE PRECISION,ALLOCATABLE,INTENT(IN) :: Array(:,:)
INTEGER :: iii,jjj
PRINT*,SIZE(Array,1),SIZE(Array,2)
DO iii=1,SIZE(Array,1)
DO jjj=1,SIZE(Array,2)
PRINT*,Array(iii,jjj)
ENDDO
ENDDO
END SUBROUTINE
!%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!
If a procedure has a dummy argument that is an allocatable, then an explicit interface is required in any calling scope.
(There are numerous things that require an explicit interface, an allocatable dummy is but one.)
You can provide that explicit interface yourself by putting an interface block for your subroutine inside the main program. An alternative and far, far, far better option is to put the subroutine inside a module and then USE that module in the main program - the explicit interface is then automatically created. There is an example of this on the eng-tips site that you provided a link to - see the post by xwb.
Note that it only makes sense for a dummy argument to have the allocatable attribute if you are going to do something related to its allocation status - query its status, reallocate it, deallocate it, etc.
Please also note that your allocatable dummy argument array is declared with intent(in), which means its allocation status will be that of the associated actual argument (and it may not be changed during the procedure). The actual argument passed to your subroutine may be unallocated and therefore illegal to reference, even with an explicit interface. The compiler will not know this and the behaviour of inquiries like size is undefined in such cases.
Hence, you first have to check the allocation status of array with allocated(array) before referencing its contents. I would further suggest to implement loops over the full array with lbound and ubound, since in general you can't be sure about array's bounds:
subroutine subtest(array)
double precision, allocatable, intent(in) :: array(:,:)
integer :: iii, jjj
if(allocated(array)) then
print*, size(array, 1), size(array, 2)
do iii = lbound(array, 1), ubound(array, 1)
do jjj = lbound(array, 2), ubound(array, 2)
print*, array(iii,jjj)
enddo
enddo
endif
end subroutine
This is a simple example that uses allocatable dummy arguments with a module.
module arrayMod
real,dimension(:,:),allocatable :: theArray
end module arrayMod
program test
use arrayMod
implicit none
interface
subroutine arraySub
end subroutine arraySub
end interface
write(*,*) allocated(theArray)
call arraySub
write(*,*) allocated(theArray)
end program test
subroutine arraySub
use arrayMod
write(*,*) 'Inside arraySub()'
allocate(theArray(3,2))
end subroutine arraySub
I have a Fortran module that I want to organize following OOP philosophy as much as possible, while still making it compatible with Fortran 2003. This module basically: (a) allocs/frees temporary array buffers, and (b) provides a function do_F which operates on some data. This function do_F uses these temporary buffers, but also depends several auxiliary types.
It is clear to me that I should put the buffers into a type, and initialize/free when appropriate. However, since each call to do_F takes several arguments, I am sure what is the best design strategy to use.
To be more concrete, consider the following implementations:
Pass a large number of types every time do_F is called
type object_t
! lots of private buffers
real, allocatable :: buf1(:,:,:), buf2(:,:,:), etc.
end type object_t
subroutine init_object(this)
type(object_t), intent(INOUT) :: this
allocate( this%buf1(..., ..., ...) )
!...
end subroutine init_object
subroutine do_F(this, data, aux1, aux2, ..., auxN)
type(object_t), intent(INOUT) :: this
type(data_t), intent(INOUT) :: data
type(aux1_t), intent(IN) :: aux1
!...
!do stuff on data using the buffers and values stored
! in aux1 .. auxN
end subroutine do_F
Save the pointers to the types that do_F needs
type object_t
! lots of private buffers
real, allocatable :: buf1(:,:,:), buf2(:,:,:), etc.
! pointers to auxiliary types
type(aux1_t), pointer :: aux1_ptr
!...
end type object_t
subroutine init_object(this, aux1, aux2, ..., auxN)
type(object_t), intent(INOUT) :: this
type(aux1_t), intent(IN), target :: aux1
!...
allocate( this%buf1(..., ..., ...) )
!...
this%aux1_ptr => aux1
!...
end subroutine init_object
subroutine do_F(this, data)
type(object_t), intent(INOUT) :: this
type(data_t), intent(INOUT) :: data
!do stuff on data using the buffers and values stored
! in this%aux1_ptr .. this%auxN_ptr
end subroutine do_F
My specific questions are:
Is the implementation #2 valid? PGI compiler didn't complain about it, but I heard that an intent(IN) is no longer well defined after the function returns
Is there a performance loss by using this scheme with pointers? Even if I don't write into these aux_ptr's, will the compiler be able to optimize my code as well as in case #1?
Some notes:
The function do_F is called ~100 times, and each call takes a couple of minutes and operates on large arrays.
Apart from do_F, there are also do_G and do_H functions that operate on the same data and use the same aux variables. That is why I wanted to reduce the number of variables passed to the function in the first place.
I don't want to combine all the aux variables into one type, because they are used throughout the rest of a large HPC code.
Thanks!
Intent IN variables are well defined after return, if they were before the call. The procedure is not allowed to change them. An exception are the values of pointer variables, where you can change the value of the target, but not the association status of the pointer for intent(IN) pointer dummy arguments.
I'm not sure about the efficiency though. Version 2 looks nicer after a quick read.