I have the following serial part of a FORTRAN code which should be executed
sequentially.
c$acc serial
nppn1=0
c
do 1200 ippas=1,50
c
nppn0=nppn1+1
nppn1=lppas(ippas)
c
c -----did we complete the passes ?
c
if(nppn1.eq.0) goto 1201
c
c -----do we have any ?
c
if(nppn0.gt.nppn1) goto 1199
c
c -----loop over the receiving points
c
c$acc loop seq
do 1400 ippne=nppn0,nppn1
c
c -----points
c
ipoin=bppni(1,ippne)
jpoin=bppni(2,ippne)
c
c -----variables 1-nunkp
c
c$acc loop seq
do 1410 iva=1,nunkp
unkno(iva,ipoin)=unkno(iva,jpoin)+bppnr(iva,ippne)
1410 continue
c
c ----end of loop over the receiving points
c
1400 continue
c
c
c ----end of loop over the passes
c
1199 continue
1200 continue
1201 continue
c$acc end serial
The results obtained using the PGI compiler with the OpenACC directives
deactivated look like,
ipoin jpoin unkno(iva,ipoin) unkno(iva,jpoin)
before loop 160215 160165 100.3518075025082 100.3517910648527
after loop 160215 160165 100.3517910648527 100.3517910648527
before loop 160165 157415 100.3517910648527 100.3517910648527
after loop 160165 157415 100.3517910648527 100.3517910648527
which is the expected behavior. However, when OpenAcc directives are activated
the values are not updated.
ipoin jpoin unkno(iva,ipoin) unkno(iva,jpoin)
before loop 160215 160165 100.3518075025082 100.3517910648527
after loop 160215 160165 100.3518075025082 100.3517910648527
When compiling, the PGI compiler says the following:
2552, Accelerator serial kernel generated
Generating Tesla code
2556, !$acc do seq
2588, !$acc do seq
2603, !$acc do seq
2552, Generating implicit copyin(bppnr(:nunkp,:),bppni(:2,:),lppas(:))
Generating implicit copy(unkno(:nunkp,:))
So, I don't know what is happening here and how I can solve this issue. Any ideas?
Related
I'm trying to use a worker-private array with OpenACC, but i keep getting wrong results. I guess there is some kind race condition issue going on, but I can't find where.
I'm using the PGI compiler (18.10, OpenPower) and compile with :
pgf90 -gopt -O3 -Minfo=all -Mcuda=ptxinfo -acc -ta=tesla:cc35 main.F90
Here is a minimal example of what i'm trying to achieve:
#define lx 7000
#define ly 500
program test
implicit none
integer :: tmp(ly,1), a(lx,ly), b(lx,ly)
integer :: x,y,i
do x=1,lx
do y=1,ly
a(x,y) = x+y
end do
end do
!$acc parallel num_gangs(1)
!$acc loop worker private(tmp)
do x=1,lx
!$acc loop vector
do y=1,ly
tmp(y,1) = -a(x,y)
end do
!$acc loop vector
do y=1,ly
b(x,y) = -tmp(y,1)
end do
end do
!$acc end parallel
print *, "check"
do x=1,lx
do y=1,ly
if(b(x,y) /= x+y) print *, x, y, b(x,y), x+y
end do
end do
print*, "end"
end program
What I expected was to get b == a, but it's not the case.
Please note that I defined tmp(ly,1) because i get the expected result when I define tmp(ly) as a 1D array. Even if it works with a 1D array, i'm not sure it fully respects the OpenACC standard.
Am I missing something here?
EDIT: The last loop checks if a==b and prints the values that are wrong. The expected output (that I get with OpenACC disabled) is :
check
end
What I get with OpenACC enabled is something like this (changes between runs):
check
1 1 5 2
1 2 6 3
1 3 7 4
[...]
end
Looks like a compiler issue where "tmp" is being shared by the workers instead of each worker getting a private copy. This in turn causes a race condition in your code.
I've filed a problem report with PGI (TPR#27025) and sent it to our engineers for further investigation.
The work around is to use "gang" instead of "worker" on the outer loop or as you note, make "tmp" as single dimension array.
Update: TPR #27025 was fixed in the PGI 19.7 release.
These two acc loop
!$acc loop vector
do y=1,ly
tmp(y,1) = -a(x,y)
end do
!$acc loop vector
do y=1,ly
b(x,y) = -tmp(y,1)
end do
will be executed on gpu at the same time. That is, they are executed in parallel. To ensure tmp is assgined to correct values in the first loop before it is used in the second loop, they have to be on different acc parallel construct.
The correct code will look like:
do x=1,lx
!$acc parallel loop
do y=1,ly
tmp(y,1) = -a(x,y)
end do
!$acc parallel loop
do y=1,ly
b(x,y) = -tmp(y,1)
end do
end do
I would like to vectorize this code below (just for an example), just assume somehow I should write an array inside an array.
PROGRAM TEST
IMPLICIT NONE
REAL, DIMENSION(2000):: A,B,C !100000
INTEGER, DIMENSION(2000):: E
REAL(KIND=8):: TIME1,TIME2
INTEGER::I
DO I=1, 2000 !Actually only this loop could be vectorized
B(I)=100.00 !by the compiler
C(I)=200.00
E(I)=I
END DO
!Computing computer's running time (start)
CALL CPU_TIME (TIME1)
DO I=1, 2000 !This is the problem, somehow I should put
A(E(I))=B(E(I))*C(E(I)) !an integer array E(I) inside an array
END DO !I would like to vectorize this loop also, but it didn't work
PRINT *, 'Results =', A(2000)
PRINT *, ' '
!Computing computer's running time (finish)
CALL CPU_TIME (TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM TEST
I thought at first time, that compiler could understand what I want which somehow be vectorized like this:
DO I=1, 2000, 4 !Unrolled 4 times
A(E(I))=B(E(I))*C(E(I))
A(E(I+1))=B(E(I+1))*C(E(I+1))
A(E(I+2))=B(E(I+2))*C(E(I+2))
A(E(I+3))=B(E(I+3))*C(E(I+3))
END DO
but I was wrong. I used: gfortran -Ofast -o -fopt-info-optimized Tes.F95 and I got the information that only the first looping was successfully to be vectorized.
Do you have any idea how I could vectorize it? Or can't it be vectorized at all?
If E hase equal values for different I, then you would be manipulating the same elements of A multiple times, in which case the order could matter. (Though not in your case.) Also, if you have multiple index arrays, like E1, E2 and E3, and
DO I=1, 2000
A(E3(I))=B(E1(I))*C(E2(I))
END DO
the order could matter too. So I think this kind of indexing is not in general allowed in parallel loops.
With ifort one can use !DIR$ IVDEP which is "ignore Vector dependence". It only works when E(I) is linear as in the example...
Assuming that one wants to do all the indexes then just replace (E(i)) with (I) and work out the obvious E(I) order later...
I am working on a piece of legacy F77 code and trying to convert it
to equivalent F90 code. I ran into these lines below and could some
one advise if my conversion is correct?
Fortran 77 code:
Subroutine area(x,y,z,d)
do 15 j=1,10
if (a.gt.b) go to 20
15 CONTINUE
20 Statement 1
Statement 2
Statement 3
end subroutine
I tried to convert it to F90 and came up as below:
Subroutine area(x,y,z,d)
dloop: do j=1,10
if (a>b) then
statement 1
statement 2
statement 3
else
write(*,*) 'Exiting dloop'
exit dloop
end if
end do dloop
end subroutine
Could anyone advise if this methodology is correct? In my results, I am not getting the results that I expect. So there could potentially be a problem with my logic.
You got the translation slightly wrong... The first step is to rebuild the do loop, which loops at 15:
Subroutine area(x,y,z,d)
do j=1,10
if (a.gt.b) go to 20
enddo
20 Statement 1
Statement 2
Statement 3
end subroutine
Now you can see that the goto results in "jumping out of the loop". In this particular example, this equivalent to an exit, and the code can be written as
Subroutine area(x,y,z,d)
do j=1,10
if (a.gt.b) exit
enddo
Statement 1
Statement 2
Statement 3
end subroutine
The number of files that are getting written is always less than the number of threads. Logically for me, when I can have 4 threads and the CPU is working at 400%, I was expecting the number of files to be 4 (one each corresponding to every single thread). I don't know if there is a problem with my code or this is how it is supposed to work. The code is as follows:
!!!!!!!! module
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
contains
subroutine evolve(y,yevl)
implicit none
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id)
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
end module common
use common
implicit none
integer(dp):: iii,iter,i
integer(dp),parameter:: id=2
real(dp),allocatable:: y(:),yt(:)
integer(dp):: OMP_GET_THREAD_NUM, IXD
allocate(y(id)); allocate(yt(id)); y=0.d0; yt=0.d0; bb=0.3d0
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/80000.d0
loop1: do iter=1,10 !! THE INITIAL CONDITION LOOP
call random_number(y)!! RANDOM INITIALIZATION OF THE VARIABLE
loop2: do i=1,70000 !! ITERATION OF THE SYSTEM
call evolve(y,yt)
y=yt
enddo loop2 !! END OF SYSTEM ITERATION
write(IXD+1,*)aa,yt !!! WRITING FILE CORRESPONDING TO EACH THREAD
enddo loop1 !!INITIAL CONDITION ITERATION DONE
enddo
!$OMP ENDDO
!$OMP END PARALLEL
end
Is this behavior resulting from some race issue in the code? The code compiles and executes just fine without any warnings or errors with ifort version 13.1.0 on ubuntu. Thanks a bunch for any comments or suggestions.
The variable IXD should be explicitely declared as private to make sure every thread has an own copy of it. Changing the line(s)
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
to
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt,ixd) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
solves the problem.
I'm writing a matrix multiplication subroutine in Fortran. I'm using the Intel Fortran compiler. I've written a simple static scheduled parallel do-loop. Unfortunately, it's running on only one thread. Here's the code:
SUBROUTINE MATMULT(A,B,C,L,M,N)
REAL*8 A,B,C
INTEGER NCORES, CHUNK, TID
DIMENSION A(L,N),B(L,M),C(M,N)
PARAMETER (NCORES=8)
CHUNK=(L/(NCORES+1))+1
TID=0
!$OMP PARALLELDO SHARED(A,B,C,L,M,N,CHUNK) PRIVATE(I,J,K,TID)
!$OMP+DEFAULT(NONE) SCHEDULE(STATIC,CHUNK)
DO I=1,L
TID = OMP_GET_THREAD_NUM()
PRINT *, "THREAD ", TID, " ON I=", I
DO K=1,N
DO J=1,M
A(I,K) = A(I,K) + B(I,J)*C(J,K)
END DO
END DO
END DO
!$OMP END PARALLELDO
RETURN
END
Note:
There are no parallel directives in the main program that calls the routine
The arrays A,B,C are initialized serially in the main program. A is initialized to zeros
I am enforcing the Fortran fixed source form during compilation
I have confirmed the following:
Another example program works fine with 8 threads (so no hardware issue)
I have used the -openmp compiler argument
OMP_GET_NUM_PROCS() and OMP_GET_MAX_THREADS() both return 0
TID is 0 for every iteration over I (which shouldn't be the case)
I am unable to diagnose my mistake. I'd appreciate any inputs on this.
The identifier OMP_GET_THREAD_NUM is not explicitly declared. The default implicit typing rules mean it will be of type real. That's not consistent with the declaration in the OpenMP spec for the function of that name.
Adding USE OMP_LIB would fix that issue. Further, not using implicit typing (IMPLICIT NONE) would avoid this and a multitude of similar problems.