I am currently accelerating a Fortran code where a contained subroutine (subsub) accesses and modifies variables declared in the parent subroutine (sub):
module mod
implicit none
contains
subroutine sub
integer :: var(10)
integer :: i
!$acc kernels loop
do i = 1, 10
call subsub
enddo
contains
subroutine subsub
!$acc routine
var(i) = i
endsubroutine
endsubroutine
endmodule
program test
use mod
call sub
endprogram
When compiling with the PGI compiler version 20.9-0, it complains that subsub cannot refer to the host variable var:
sub:
8, Generating implicit copy(.S0000) [if not already present]
9, Loop is parallelizable
Generating Tesla code
9, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
NVFORTRAN-S-0155-acc routine cannot be used for contained subprograms that refer to host subprogram data: var (test.f90)
0 inform, 0 warnings, 1 severes, 0 fatal for subsub
Which makes sense.
I tried to create var on the device with acc data create(var) or acc declare create(var), but it does not change the outcome.
Can this pattern be accelerated at all?
No, this pattern wont work. For contained routines, the compiler passes a hidden argument to the parent's stack pointer. In this case, the stack pointer would be to the host, which will cause problems when trying to access it from the device.
The work around would be to pass in the variables to the subroutine. For example:
% cat test2.f90
module mod
implicit none
contains
subroutine sub
integer :: var(10)
integer :: i
!$acc kernels loop
do i = 1, 10
call subsub(var,i)
enddo
print *, var
contains
subroutine subsub(var,i)
!$acc routine
integer :: var(10)
integer, value :: i
var(i) = i
endsubroutine
endsubroutine
endmodule
program test
use mod
call sub
endprogram
% nvfortran test2.f90 -acc -Minfo=accel ; a.out
sub:
8, Generating implicit copy(.S0000,var(:)) [if not already present]
9, Loop is parallelizable
Generating Tesla code
9, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
subsub:
14, Generating acc routine seq
Generating Tesla code
1 2 3 4 5 6
7 8 9 10
Related
Say I have a Fortran program that performs two tasks on an array: task A computes its mean and task B doubles it. The point is that task B should be independent from task A. When accelerating the program with OpenACC, it would make sense to run the two tasks concurrently by making task A asynchronous:
program test
implicit none
integer, parameter :: n = 1000000
real(8) :: mean
real(8) :: array(n)
real(8) :: array_d(n)
! initialize array
array = [(i, i=1, n)]
!$acc kernels async num_gangs(1)
! Task A: get mean of array
mean = 0d0
!$acc loop independent reduction(+:mean)
do i = 1, n
mean = mean + array(i)
end do
mean = mean / n
!$acc end kernels
!$acc kernels
! Task B: work on array
!$acc loop independent
do i = 1, n
array(i) = array(i) * 2
end do
!$acc end kernels
!$acc wait
!$acc end data
! print array and mean
print "(10(g0.2, x))", array(:10)
print "('mean = ', g0.2)", mean
end program
However, when running the two tasks at the same time, task B will modify the array that task A is reading, leading to incorrect values. On CPU (no acceleration) I get:
2.0 4.0 6.0 8.0 10. 12. 14. 16. 18. 20.
mean = 500000.5000000000
On GPU (using the NVIDIA HPC SDK), I get a different mean which is obviously incorrect:
2.0 4.0 6.0 8.0 10. 12. 14. 16. 18. 20.
mean = 999967.6836640000
Is there an elegant way to "protect" the array being worked by task A?
I have MPI ranks split up to calculate different parts an an array, then I want to put/send those slices onto a different rank that doesn't participate in the calculation. That rank is the master of a new communicator set up to do other things with the array (averaging, IO, etc). I got it to work with MPI_isend and MPI_irecv, and now I want to try MPI_Put.
use mpi_f08
use iso_c_binding
implicit none
integer, parameter :: n=10, gps = 18, pes=12, dpes = 6
integer :: main=pes, d=dpes
integer :: diag_master
integer :: global_size, global_rank, diag_size, diag_rank
type(MPI_comm),allocatable :: diag_comm
integer :: pelist_diag
TYPE(MPI_Win) :: win
integer :: ierr, i, j
type(MPI_COMM) :: comm, mycomm
integer :: gsz, grk
integer :: lsz, lrk
integer(KIND=MPI_ADDRESS_KIND) :: local_group
logical :: local_flag
integer :: color,key
!!! THIS IS THE ARRAY
real, dimension(n,pes) :: r
!!!
logical :: on_dpes = .false.
logical,allocatable,dimension(:) :: dpes_list ! true if on dpes list
integer :: comm_manager
integer :: dmg
integer(KIND=MPI_ADDRESS_KIND) :: buff_size !< the size of a variable type
integer(kind=MPI_ADDRESS_KIND) :: displacement
integer :: disp_size
integer :: loc_base
integer, pointer :: fptr
!!!!!!!! THIS ALL WORKS BEGIN !!!!!!!!
comm=MPI_COMM_WORLD
call MPI_INIT(ierr)
call MPI_COMM_SIZE(COMM, gsz, ierr)
call MPI_COMM_RANK(COMM, grk, ierr)
allocate(dpes_list(gsz))
! write (6,*) "I am ",grk," of ",gsz
!> Find the group
call MPI_COMM_GET_ATTR(COMM,MPI_APPNUM,local_group,local_flag,ierr)
!> Split a new communicator as mycom
color = int(local_group)
key = 0
call MPI_COMM_SPLIT(COMM, color, key, mycomm, ierr)
!> Get information about the split communicators
call mpi_comm_size(mycomm,lsz,ierr)
call mpi_comm_rank(mycomm,lrk,ierr)
!> Create data on the main communicator
if (lsz == pes) then
comm_manager = main
on_dpes = .false.
r = 0.0
if (mod(lrk,2) == 0) then
c_loop: do concurrent (i=1:n)
r(i,lrk+1) = sin(real(i))+real(i)
enddo c_loop
else
r(:,lrk+1) = 10.0-dble(lrk)
endif
if (lsz == dpes) then
diag_size = lsz
diag_rank = lrk
comm_manager = d
on_dpes = .true.
diag_comm = mycomm
if (lrk==0) then
dmg = grk
endif
endif
call MPI_ALLGATHER(on_dpes,1,MPI_LOGICAL, &
dpes_list,gsz,MPI_LOGICAL, MPI_COMM_WORLD, ierr)
!> Get the master of dpes
do i=1,gsz
if (dpes_list(i)) then
dmg = i-1
exit
endif
enddo
diag_master = dmg
diag_global_master = dmg
!!!!!!!! THIS ALL WORKS END !!!!!!!!
!! At this point, the ranks that participate in the calculation
!! have values in r(i,lrk+1) where lrk is their rank
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!! THIS IS WHERE THINGS GO WRONG? !!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
disp_size = storage_size(r)
buff_size = disp_size*size(r)
call c_f_pointer(c_loc(r(1,1)),fptr)
loc_base = fptr
nullify(fptr)
write (6,*) loc_base, grk
call MPI_Win_create(loc_base,buff_size,disp_size,MPI_INFO_NULL,&
mpi_comm_world,win,ierr)
call MPI_Win_Fence(0,win,ierr)
displacement = loc_base + disp_size *buff_size
! if (.not.allocated(diag_comm)) then
if (grk == 11) then
call MPI_Put(r(:,global_rank+1),size(r,1),MPI_FLOAT,&
diag_master,displacement,size(r,1), MPI_FLOAT, win ,ierr)
endif
call MPI_Win_Fence(0,win,ierr)
CALL MPI_WIN_FREE(win, ierr)
call MPI_FINALIZE(ierr)
I have ! if (.not.allocated(diag_comm)) then commented out because I tried to do this with all of the ranks that calculate r, but I got the same result.
I am compiling with mpiifort -O0 -fpe0 -init=snan,arrays -no-wrap-margin -traceback -stand f18 and run with mpirun -n 12 ./$#.x : -n 6 ./$#.x in my Makefile. The version of mpiifort I am using is
> mpiifort -v
mpiifort for the Intel(R) MPI Library 2019 Update 2 for Linux*
Copyright 2003-2019, Intel Corporation.
ifort version 19.0.2.187
The output (write (6,*) loc_base, grk)is strange.
1072411986 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
0 10
0 11
2142952877 12
2142952877 13
2142952877 14
2142952877 15
2142952877 16
2142952877 17
Rank 12-17 are the ranks that don't participate in "calculating r", but I'm not sure why c_loc(r(1,1)) is different for these ranks. Also, it is different for rank 0.
My actual questions are
1) How do I calculate the displacement variable? Am I doing it correctly? Is it supposed to be different between ranks because it will be in this case?
2) Why is c_loc(r(1,1)) different for the ranks 12-17? Does it have anything to do with the fact that this is a SPMD program? Why is it different for rank 0?
3) Can I do the one way communication with all of the ranks instead of just one? I had each rank call mpi_isend, and then i just called mpi_irecv in a loop through all of the ranks sending when I did this the other way. Can I do something similar with MPI_Put? Should I be using MPI_Get? Something else?
4) How do I get this to work? This is just an educational example for myself, and what I actually need to do is much more complicated.
I can answer item 2, at least. You have:
call c_f_pointer(c_loc(r(1,1)),fptr)
loc_base = fptr
where loc_base is declared integer. You seem to be assuming that loc_base is some sort of address, but it is not. In Fortran, intrinsic assignment from a pointer assigns the value of the target, not the location of the target. So you're effectively doing a TRANSFER of the REAL values of r to loc_base - probably not what you want.
I'm writing a subroutine that transform a regular vector into the one with only non-zero elements. Say, vector a=(0,0,1,2,3)' (n by 1). Then the non-zero vector is c=(1,2,3), and the row index is recorded as ic=(0,0,0,1,2,3) where ic(1)=0, ic(i+1)-ic(i) is the number of non-zero elements in i-th row. The vector index jc=(1,1,1) with size 3 as there are 3 non-zero entries. See the sparse matrix wiki for FYI: https://en.wikipedia.org/wiki/Sparse_matrix.
Despite its simplicity, I'm having troubles in running the following code named sparsem.f90
!This subroutine coverts a regular sparse matrix a into a CSR form
MODULE SPARSEM
CONTAINS
SUBROUTINE vsparse(a,c,jc,ic,counta,ierr,myid)
IMPLICIT NONE
REAL(8), INTENT(IN):: a(:)
INTEGER, INTENT(IN):: counta,myid
REAL(8), INTENT(OUT):: c(counta)
INTEGER, INTENT(OUT):: jc(counta),ic(size(a)+1)
INTEGER:: ierr,countaa,i
character(len=90):: filename
ierr=0
jc=0
c=0.0d0
ic=0
PRINT *, 'SIZE OF A IN VSPARSE', size(a),count(a>0.0d0),counta
IF (COUNT(a>0.0d0) /= counta) THEN
ierr=1
PRINT *, 'ERROR: number count of non-zero a(i,j) is not', counta
ELSE
countaa=0
ic(1)=0
DO i=1,size(a)
IF (a(i) > 0.0d0 ) THEN
countaa=countaa+1
c(countaa)=a(i)
ic(i+1)=ic(i)+1
jc(countaa)=1
IF (countaa<100) PRINT *,'checkcheckcheck', a(i), &
countaa,jc(countaa),c(countaa),jc(1:5)
ELSE
ic(i+1)=ic(i)
END IF
END DO
PRINT *, 'JCJCJCJC',jc(1:5)
END IF
IF (myid==7) THEN
WRITE(filename,'("sparsedens_dcheck",I1,".txt")') myid+1
OPEN(UNIT=212101, FILE="/home/wenya/Workspace/Model4/valuef/"//filename,ACTION='write',status='replace')
DO i=1,counta+1
IF (i<=counta) THEN
WRITE(212101,*) c(i),jc(i)
ELSE
WRITE(212101,*) 0.0D0,0
END IF
END DO
CLOSE(212101)
END IF
return
END SUBROUTINE vsparse
END MODULE SPARSEM
So the three print jc codes shall give 1 1 1 1 1.... Yet starting from the second print jc code, the result is 6750960 6750691 6750692 .... The array of jc has size 9,000,000. And I know the first 2250000 element is 0.
To replicate this problem, here is the main program
PROGRAM MAIN
USE SPARSEM
IMPICIT NONE
REAL(8):: dens_last(9000000)
REAL(8), ALLOCATABLE :: dens(:)
INTEGER, ALLOCATABLE :: ic(:),jc(:)
INTEGER:: i
dens_last(1:2250000)=0.0d0
dens_last(2250001:9000000)=1.0d0/6750000.0d0
ncount=count(dens_last>0.0d0)
ALLOCATE(dens(ncount), ic(9000000+1), jc(ncount)_
CALL VSPASEM(dens_last, dens, jc, ic, ncount,ierr)
DEALLOCATE(dens,ic,jc)
END PROGRAM MAIN
I am using gfortran 6.3.0 and openmpi latest version on a UBUNTU 17.04 computer. Although openmpi is not used in this example, it's used in the rest of the program. Any thoughts? Thanks!
I'm working on code that serially calls a subroutine (which in turn performs iterations) many times. I wish to parallelize the iterations inside the subroutine. The problem with mpi is that I'm only allowed to initialize it once. Hence, I cannot initialize it in my subroutine, which gets called multiple times. Can anyone suggest a way out of this?
My problem is roughly as outlined below:
program p
...
do i=1,10000
call subroutine s(i)
end do
end program p
subroutine s(j)
...
do i=1,10000
...
end do
end subroutine s
I wish to parallelize this process.
Thanks a lot. That helped! But let me re frame my question,
Within the iterations of the main program, along with the subroutine s, I have to call another subroutine s2, (which doesn't need to be parallelized). I thought, it could be done this way:
!initialize mpi
do i=1:1000
if rank!=0
call s
else call s2
end if
end do
!finalize mpi
But the main problem here is, while the rest of the processes proceed slowly, process 0 will proceed quickly. (Something not desirable).So, is it possible to make process 0 wait after each iteration till the other process complete their iteration?
You need to initialize and finalize MPI in the main program. Typically, you then define a load-balancing that is valid for the work in the subroutine.
Then you do your loop inside the subroutine in parallel and gather (reduce?) the results at the end of the subroutine so you have all information you need when the subroutine is called next.
This works the same way as it would with a loop in the main program (without calling the subroutine).
Here is a minimum example:
module testMod
use mpi
implicit none
!#include "mpif.h"
!===
contains
!===
subroutine s(mysize, myrank, array)
integer,intent(in) :: mysize, myrank
integer,intent(inout) :: array(:)
integer :: i, ierror
! Do stuff
do i=1,size(array)
! Skip element that is not associated with the current process
if ( mod(i,mysize) .ne. myrank ) cycle
array(i) = array(i) + 1
enddo ! i
! MPI Allreduce
call MPI_Allreduce(MPI_IN_PLACE, array, size(array), MPI_INTEGER, &
MPI_MAX, MPI_COMM_WORLD, ierror)
end subroutine
end module
program mpiTest
use testMod
use mpi
implicit none
!#include "mpif.h"
integer :: mysize, myrank, ierror
integer,parameter :: ITER=100
integer,parameter :: arraySize=10
integer :: work(arraySize)
integer :: i
! MPI Initialization
call MPI_Init(ierror)
call MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierror)
call MPI_Comm_size(MPI_COMM_WORLD, mysize, ierror)
work = 0
do i=1,ITER
call s(mysize, myrank, work)
enddo
if ( myrank .eq. 0 ) write(*,*) work
! MPI Finalize
call MPI_Finalize(ierror)
end program
I like to do this:
program main
implicit none
integer l
integer, allocatable, dimension(:) :: array
allocate(array(10))
array = 0
!$omp parallel do private(array)
do l = 1, 10
array(l) = l
enddo
!$omp end parallel do
print *, array
deallocate(array)
end
But I am running into error messages:
* glibc detected * ./a.out: munmap_chunk(): invalid pointer: 0x00007fff25d05a40 *
This seems to be a bug in ifort according to some discussions at intel forums but should be resolved in the version I am using (11.1.073 - Linux). This is a MASSIVE downscaled version of my code! I unfortunately can not use static arrays to have a workaround.
If I put the print into the loop, I get other errors:
* glibc detected ./a.out: double free or corruption (out): 0x00002b22a0c016f0 **
I didn't get the errors you're getting, but you have an issue with privatizing array in your OpenMP call.
[mjswartz#666-lgn testfiles]$ vi array.f90
[mjswartz#666-lgn testfiles]$ ifort -o array array.f90 -openmp
[mjswartz#666-lgn testfiles]$ ./array
0 0 0 0 0 0
0 0 0 0
[mjswartz#666-lgn testfiles]$ vi array.f90
[mjswartz#666-lgn testfiles]$ ifort -o array array.f90 -openmp
[mjswartz#666-lgn testfiles]$ ./array
1 2 3 4 5 6
7 8 9 10
First run is with private array, second is without.
program main
implicit none
integer l
integer, allocatable, dimension(:) :: array
allocate(array(10))
!$omp parallel do
do l = 1, 10
array(l) = l
enddo
print*, array
deallocate(array)
end program main
I just ran your code with ifort and openmp and it spewed 0d0's. I had to manually quit the execution. What is your expected output? I'm not a big fan of unnecessarily dynamically allocating arrays. You know what you're going to allocate your matrices as, so just make parameters and statically do it. I'll mess with some stuff and edit this response in a few.
Ok, so here's my edits:
program main
implicit none
integer :: l, j
integer, parameter :: lmax = 15e3
integer, parameter :: jmax = 25
integer, parameter :: nk = 300
complex*16, dimension(9*nk) :: x0, xin, xout
complex*16, dimension(lmax) :: e_pump, e_probe
complex*16 :: e_pumphlp, e_probehlp
character*25 :: problemtype
real*8 :: m
! OpenMP variables
integer :: myid, nthreads, omp_get_num_threads, omp_get_thread_num
x0 = 0.0d0
problemtype = 'type1'
if (problemtype .ne. 'type1') then
write(*,*) 'Problem type not specified. Quitting'
stop
else
! Spawn a parallel region explicitly scoping all variables
!$omp parallel
myid = omp_get_thread_num()
if (myid .eq. 0) then
nthreads = omp_get_num_threads()
write(*,*) 'Starting program with', nthreads, 'threads'
endif
!$omp do private(j,l,m,e_pumphlp,e_probehlp,e_pump,e_probe)
do j = 1, jmax - 1
do l = 1, lmax
call electricfield(0.0d0, 0.0d0, e_pumphlp, &
e_probehlp, 0.0d0)
! print *, e_pumphlp, e_probehlp
e_pump(l) = e_pumphlp
e_probe(l) = e_probehlp
print *, e_pump(l), e_probe(l)
end do
end do
!$omp end parallel
end if
end program main
Notice I removed your use of a module since it was unnecessary. You have an external module containing a subroutine, so just make it an external subroutine. Also, I changed your matrices to be statically allocated. Case statements are a fancy and expensive version of if statements. You were casing 15e3*25 times rather than once (expensive), so I moved those outside. I changed the OpenMP calls, but only semantically. I gave you some output so that you know what OpenMP is actually doing.
Here is the new subroutine:
subroutine electricfield(t, tdelay, e_pump, e_probe, phase)
implicit none
real*8, intent(in) :: t, tdelay
complex*16, intent(out) :: e_pump, e_probe
real*8, optional, intent (in) :: phase
e_pump = 0.0d0
e_probe = 0.0d0
return
end subroutine electricfield
I just removed the module shell around it and changed some of your variable names. Fortran is not case sensitive, so don't torture yourself by doing caps and having to repeat it throughout.
I compiled this with
ifort -o diffeq diffeq.f90 electricfield.f90 -openmp
and ran with
./diffeq > output
to catch the program vomiting 0's and to see how many threads I was using:
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
Starting program with 32 threads
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
(0.000000000000000E+000,0.000000000000000E+000)
Hope this helps!
It would appear that you are running into a compiler bug associated with the implementation of OpenMP 3.0.
If you can't update your compiler, then you will need to change your approach. There are a few options - for example you could make the allocatable arrays shared, increase their rank by one and have one thread allocate them such that the extent of the additional dimension is the number of workers in the team. All subsequent references to those arrays then need to be have the subscript for that additional rank be the omp team number (+ 1, depending on what you've used for the lower bound).
Explicit allocation of the private allocatable arrays inside the parallel construct (only) may also be an option.