Why does my OpenMP code with reduction give wrong results? - fortran

My Fortran code is as follows:
! ...................... initialization
do ia=1,NLEV
do ic=1,NLEV
ZGamma(ia,ic) =zero
enddo
enddo
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(H,ZRO) REDUCTION(+: ZGamma)
!$OMP DO SCHEDULE(DYNAMIC)
do iabcd=1,H%iabcd_max
ia = H%ka(iabcd)
ib = H%kb(iabcd)
ic = H%kc(iabcd)
id = H%kd(iabcd)
ZGamma(ia,ic)=ZGamma(ia,ic) + H%ME2BM(iabcd)*ZRO(id,ib)
ZGamma(ib,ic)=ZGamma(ib,ic) - H%ME2BM(iabcd)*ZRO(id,ia)
ZGamma(ia,id)=ZGamma(ia,id) - H%ME2BM(iabcd)*ZRO(ic,ib)
ZGamma(ib,id)=ZGamma(ib,id) + H%ME2BM(iabcd)*ZRO(ic,ia)
if(ia+ib.eq.ic+id) cycle
ZGamma(ic,ia)=ZGamma(ic,ia) + H%ME2BM(iabcd)*ZRO(ib,id)
ZGamma(id,ia)=ZGamma(id,ia) - H%ME2BM(iabcd)*ZRO(ib,ic)
ZGamma(ic,ib)=ZGamma(ic,ib) - H%ME2BM(iabcd)*ZRO(ia,id)
ZGamma(id,ib)=ZGamma(id,ib) + H%ME2BM(iabcd)*ZRO(ia,ic)
enddo ! iabcd
!$OMP END DO
!$OMP END PARALLEL
In the above code, I calculated the 2D array ZGamma(i,j) using OpenMP. Even though I can compile the code without any problem. Could anyone tell me what the problem is in the code? What changes should I make?
By the way, as the index "iabcd" running from "1" to "H%iabcd_max", the values of "(ia,ib,ic,id)" can be "(1,1,1,1),(1,1,1,2),(1,1,1,...), (1,1,2,1),(1,1,2,..)," etc.

Related

OpenACC | Fortran 90: What is the best way to parallelize nested DO loop?

I am trying to parallelize the following nested DO loop structure (the first code below) using 'Collapse' directive in OpenACC. The variable 'nbl' present in the outermost loop is present in the other DO loops, so there is dependency. Thanks to the compiler its showing an error in advance. So I had to compromise and construct 'collapse' directive only to the remaining four inner most loops. Is there a way to parallelize this loop to get maximum performance by utilizing the parallelism of "nbl = 1,nblocks" as well?
Compiler: pgfortran
Flags: -acc -fast -ta=tesla:managed -Minfo=accel
Code that's giving error due to data dependency between outer most DO loop and other inner DO loops:
!$acc parallel loop collapse(5)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
Compromised working code with lesser parllelism:
DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO
Thanks!
The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.
Try something like the following and split the parallelism into two levels:
!$acc parallel loop collapse(2)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
!$acc loop collapse(3)
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

fortran arrays same shape and size parallel in OpenMP

I would like to ask whether openMP is capable of parallelizing fortran arrays with the same shape and size using simple notation. I did some research but I am not capable to find or figure out whether it is possible.
I refer as simple notation the following form:
a = b + c * 1.1
Find below a full example:
PROGRAM Parallel_Hello_World
USE OMP_LIB
implicit none
integer, parameter :: ILEN = 1000
integer :: a(ILEN,ILEN), b(ILEN,ILEN), c(ILEN,ILEN), d(ILEN,ILEN)
integer :: i, j
a = 1
b = 2
!$OMP PARALLEL SHARED(a, b, c, d)
!$OMP DO
DO i=1,ILEN
DO j=1, ILEN
c(j,i) = a(j,i) + b(j,i) * 1.1
ENDDO
END DO
!$OMP END DO
# is this loop parallel?
d = a + b * 1.1
!$OMP END PARALLEL
write (*,*) "Total C: ", c(1:5, 1)
write (*,*) "Total D: ", d(1:5, 1)
write (*,*) "C same D? ", all(c == d)
END
Is the d loop parallelized with openMP with the current notation?
As commented by #Gilles the answer to the question is to wrap it with the workshare clause:
!$OMP WORKSHARE
d = a + b * 1.1
!$OMP END WORKSHARE
Find more info here

parallel do mistake in fortran

program main
use omp_lib
implicit none
integer :: n=8
integer :: i, j, myid, a(8, 8), b, c(8)
! Generate a 8*8 array A
!$omp parallel default(none), private(i, myid), &
!$omp shared(a, n)
myid = omp_get_thread_num()+1
do i = 1, n
a(i, myid) = i*myid
end do
!$omp end parallel
! Array A
print*, 'Array A is'
do i = 1, n
print*, a(:, i)
end do
! Sum of array A
b = 0
!$omp parallel reduction(+:b), shared(a, n), private(i, myid)
myid = omp_get_thread_num()+1
do i = 1, n
b = b + a(i, myid)
end do
!$omp end parallel
print*, 'Sum of array A by reduction is ', b
b = 0
c = 0
!$omp parallel do
do i = 1, n
do j = 1, n
c(i) = c(i) + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do is', sum(c)
!$omp parallel do
do i = 1, n
do j = 1, n
b = b + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do in another way is', b
end program main
I wrote a piece of Fortran code above to implement OpenMP to sum up all elements in a 8*8 array in three different ways. First one uses reduction and works. Second, I created a one dimension array with 8 elements. I sum up each column in parallel region and then sum them up. And this works as well. Third one I used an integer to sum up every element in array, and put it in parallel do region. This result is not correct and varies every time. I don't understand why this situation happens. Is because didn't specify public and private or the variable b is overwritten in the procedure?
There is a race condition on b on your third scenario: several threads are reading and writing the same variable without proper synchronization / privatization.
Note that you don't have a race condition in the second scenario: each thread is updating some data (i.e. c(i)) that no one else is accessing.
Finally, some solutions to your last scenario:
Add the reducion(+:b) clause to the pragma
Add a pragma omp atomic directive before the b = b + c(j,i) expression
You can implement a manual privatization

Don't always get expected results

The following is the code of a simple tidal-transport model that I have included OpenMP to parallelize the computation:
!$OMP PARALLEL SHARED(w, u, v, nthreads, chunk) PRIVATE(i, j, tid)
do it = 1, itlast
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 2, nyw-1
do i = 2, nxw-1
w(i,j) = w(i,j) - rx*depth*(u(i,j) - u(i-1,j)) &
- ry*depth*(v(i,j) - v(i,j-1))
end do
end do
!$OMP END DO
!$OMP SINGLE
call boudary_condition(it)
!$OMP END SINGLE
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 1, nyw
jv1 = j
if (jv1 .ge. nyw-1) jv1 = nyw-1
do i = 1, nxw-1
u(i,j) = u(i,j) - rxg*(w(i+1,j) - w(i,j)) &
- constant*u(i,j)*sqrt((u(i,j)**2.) + (v(i,jv1)**2.))
end do
end do
!$OMP END DO
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 1, nyw-1
do i = 1, nxw
iu1 = i
if (iu1 .ge. nxw-1) iu1 = nxw-1
v(i,j) = v(i,j) - ryg*(w(i,j+1) - w(i,j)) &
- constant*v(i,j)*sqrt((u(iu1,j)**2.) + (v(i,j)**2.))
end do
end do
!$OMP END DO
call transport_equation(chunk)
!$OMP MASTER
mtprint = it/ntserprint
if (ntserprint*mtprint .ne. it) goto 20
call timeseries(it)
20 continue
!$OMP END MASTER
end do
!$OMP END PARALLEL
The problem is I don't always get the expected results. Using the same input file, I should always get the same results, but sometimes it produces NaN in the output file. I'm not quite understand why this happens. I'm using Intel Visual Fortran Composer XE 2013 on Windows 10 to compile and run the executable file.
You need at least to have it, jv1, and ui1 private.
Try first fixing these, and let us know.

Nested Loop Optimization in OpenMP

I can't get the output result correct once applied openMP, is it anywhere get this right?
!$OMP PARALLEL DO SHARED(outmtresult,inpa,inpb,dynindexlist) PRIVATE(i,j) REDUCTION(+:outcountb)
do i=1,size1
do j=1, size1
outcountb = outcountb + 1
outmtresult(j) = tan(inpa(j) + inpb(j)) + alpha1 + dynindexlist(i)
enddo
enddo
!$OMP END PARALLEL DO
Just swap your loops and everything will be fine:
!$OMP PARALLEL DO SHARED(outmtresult,inpa,inpb,dynindexlist) PRIVATE(i,j) REDUCTION(+:outcountb)
do j=1,size1 ! <-- Swap i and
do i=1, size1 ! j here
outcountb = outcountb + 1
outmtresult(j) = tan(inpa(j) + inpb(j)) + alpha1 + dynindexlist(i)
enddo
enddo
!$OMP END PARALLEL DO
In your example, multiple threads write into the same memory address outmtresult(j) since you parallelize the do i loop.
By swapping the loops, you parallelize over do j and you will not write
at the same destination with multiple concurrent threads.