Vectorization vs SIMD vs Workshare in Fortran - fortran

From your experience, which code snippet is the fastest?
Vectorization
x(2:ndof - 1) = x(2:ndof - 1) - dt*( &
x(ndof + 1:ndof+ ndof - 2) + (ndof + 3:ndof + ndof))
Vectorization + workshare
!$omp parallel workshare schedule (runtime)
x(2:ndof - 1) = x(2:ndof - 1) - dt*( &
x(ndof + 1:ndof+ ndof - 2) + (ndof + 3:ndof + ndof))
!$omp end parallel workshare
Explicit loop + single instruction multiple data of openmp
!$omp parallel do simd schedule (runtime)
do idx = 2, ndof - 1, 1
x(idx) = x(idx) - (dt*coef*( &
x(ndof+idx - 1) + x(ndof + idx+1)))
end do
!$omp end parallel do simd
I do not include a simple explicit loop as it is the one I want to improve performance.

Related

Why does my OpenMP code with reduction give wrong results?

My Fortran code is as follows:
! ...................... initialization
do ia=1,NLEV
do ic=1,NLEV
ZGamma(ia,ic) =zero
enddo
enddo
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(H,ZRO) REDUCTION(+: ZGamma)
!$OMP DO SCHEDULE(DYNAMIC)
do iabcd=1,H%iabcd_max
ia = H%ka(iabcd)
ib = H%kb(iabcd)
ic = H%kc(iabcd)
id = H%kd(iabcd)
ZGamma(ia,ic)=ZGamma(ia,ic) + H%ME2BM(iabcd)*ZRO(id,ib)
ZGamma(ib,ic)=ZGamma(ib,ic) - H%ME2BM(iabcd)*ZRO(id,ia)
ZGamma(ia,id)=ZGamma(ia,id) - H%ME2BM(iabcd)*ZRO(ic,ib)
ZGamma(ib,id)=ZGamma(ib,id) + H%ME2BM(iabcd)*ZRO(ic,ia)
if(ia+ib.eq.ic+id) cycle
ZGamma(ic,ia)=ZGamma(ic,ia) + H%ME2BM(iabcd)*ZRO(ib,id)
ZGamma(id,ia)=ZGamma(id,ia) - H%ME2BM(iabcd)*ZRO(ib,ic)
ZGamma(ic,ib)=ZGamma(ic,ib) - H%ME2BM(iabcd)*ZRO(ia,id)
ZGamma(id,ib)=ZGamma(id,ib) + H%ME2BM(iabcd)*ZRO(ia,ic)
enddo ! iabcd
!$OMP END DO
!$OMP END PARALLEL
In the above code, I calculated the 2D array ZGamma(i,j) using OpenMP. Even though I can compile the code without any problem. Could anyone tell me what the problem is in the code? What changes should I make?
By the way, as the index "iabcd" running from "1" to "H%iabcd_max", the values of "(ia,ib,ic,id)" can be "(1,1,1,1),(1,1,1,2),(1,1,1,...), (1,1,2,1),(1,1,2,..)," etc.

parallel do mistake in fortran

program main
use omp_lib
implicit none
integer :: n=8
integer :: i, j, myid, a(8, 8), b, c(8)
! Generate a 8*8 array A
!$omp parallel default(none), private(i, myid), &
!$omp shared(a, n)
myid = omp_get_thread_num()+1
do i = 1, n
a(i, myid) = i*myid
end do
!$omp end parallel
! Array A
print*, 'Array A is'
do i = 1, n
print*, a(:, i)
end do
! Sum of array A
b = 0
!$omp parallel reduction(+:b), shared(a, n), private(i, myid)
myid = omp_get_thread_num()+1
do i = 1, n
b = b + a(i, myid)
end do
!$omp end parallel
print*, 'Sum of array A by reduction is ', b
b = 0
c = 0
!$omp parallel do
do i = 1, n
do j = 1, n
c(i) = c(i) + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do is', sum(c)
!$omp parallel do
do i = 1, n
do j = 1, n
b = b + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do in another way is', b
end program main
I wrote a piece of Fortran code above to implement OpenMP to sum up all elements in a 8*8 array in three different ways. First one uses reduction and works. Second, I created a one dimension array with 8 elements. I sum up each column in parallel region and then sum them up. And this works as well. Third one I used an integer to sum up every element in array, and put it in parallel do region. This result is not correct and varies every time. I don't understand why this situation happens. Is because didn't specify public and private or the variable b is overwritten in the procedure?
There is a race condition on b on your third scenario: several threads are reading and writing the same variable without proper synchronization / privatization.
Note that you don't have a race condition in the second scenario: each thread is updating some data (i.e. c(i)) that no one else is accessing.
Finally, some solutions to your last scenario:
Add the reducion(+:b) clause to the pragma
Add a pragma omp atomic directive before the b = b + c(j,i) expression
You can implement a manual privatization

Don't always get expected results

The following is the code of a simple tidal-transport model that I have included OpenMP to parallelize the computation:
!$OMP PARALLEL SHARED(w, u, v, nthreads, chunk) PRIVATE(i, j, tid)
do it = 1, itlast
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 2, nyw-1
do i = 2, nxw-1
w(i,j) = w(i,j) - rx*depth*(u(i,j) - u(i-1,j)) &
- ry*depth*(v(i,j) - v(i,j-1))
end do
end do
!$OMP END DO
!$OMP SINGLE
call boudary_condition(it)
!$OMP END SINGLE
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 1, nyw
jv1 = j
if (jv1 .ge. nyw-1) jv1 = nyw-1
do i = 1, nxw-1
u(i,j) = u(i,j) - rxg*(w(i+1,j) - w(i,j)) &
- constant*u(i,j)*sqrt((u(i,j)**2.) + (v(i,jv1)**2.))
end do
end do
!$OMP END DO
!$OMP DO SCHEDULE(DYNAMIC, CHUNK)
do j = 1, nyw-1
do i = 1, nxw
iu1 = i
if (iu1 .ge. nxw-1) iu1 = nxw-1
v(i,j) = v(i,j) - ryg*(w(i,j+1) - w(i,j)) &
- constant*v(i,j)*sqrt((u(iu1,j)**2.) + (v(i,j)**2.))
end do
end do
!$OMP END DO
call transport_equation(chunk)
!$OMP MASTER
mtprint = it/ntserprint
if (ntserprint*mtprint .ne. it) goto 20
call timeseries(it)
20 continue
!$OMP END MASTER
end do
!$OMP END PARALLEL
The problem is I don't always get the expected results. Using the same input file, I should always get the same results, but sometimes it produces NaN in the output file. I'm not quite understand why this happens. I'm using Intel Visual Fortran Composer XE 2013 on Windows 10 to compile and run the executable file.
You need at least to have it, jv1, and ui1 private.
Try first fixing these, and let us know.

Nested Loop Optimization in OpenMP

I can't get the output result correct once applied openMP, is it anywhere get this right?
!$OMP PARALLEL DO SHARED(outmtresult,inpa,inpb,dynindexlist) PRIVATE(i,j) REDUCTION(+:outcountb)
do i=1,size1
do j=1, size1
outcountb = outcountb + 1
outmtresult(j) = tan(inpa(j) + inpb(j)) + alpha1 + dynindexlist(i)
enddo
enddo
!$OMP END PARALLEL DO
Just swap your loops and everything will be fine:
!$OMP PARALLEL DO SHARED(outmtresult,inpa,inpb,dynindexlist) PRIVATE(i,j) REDUCTION(+:outcountb)
do j=1,size1 ! <-- Swap i and
do i=1, size1 ! j here
outcountb = outcountb + 1
outmtresult(j) = tan(inpa(j) + inpb(j)) + alpha1 + dynindexlist(i)
enddo
enddo
!$OMP END PARALLEL DO
In your example, multiple threads write into the same memory address outmtresult(j) since you parallelize the do i loop.
By swapping the loops, you parallelize over do j and you will not write
at the same destination with multiple concurrent threads.

Trouble parallelizing OpenACC loop

I have a old code written in FORTRAN and i need to accelerate it using OpenACC but when i try using directives, it says there is a dependance of un,vn,pn which prevents parallelism. Is it possible to parallelize this loop? I am new to OpenACC but have parallelized with OpenMP
!$acc parallel loop
do 9000 j=2,jmaxm
jm=j-1
jp=j+1
do 9001 i=2,imaxm
im=i-1
ip=i+1
if(rmask(i,j).eq.1.0) then
! Calculate un field
un(i,j,kp)=un(i,j,km)+ tdt*rmask(i,j)*(
+ txsav(i,j)*zn(nmm)/xpsi2(nmm)+ visch*zetun(i,j)
+ -recdx*(pn(ip,j,k)-pn(i,j,k))-a*un(i,j,km)/cn(nmm)**2
+ +0.25* fu(i,j)*(vn(i,j,k)+vn(ip,j,k)+vn(i,jm,k)+
+ vn(ip,jm,k))
+ -damp(i,j)*un(i,j,km)
+ )
c SBnd damper is not used
cc + -(1./timkwd)*dampu(i,j)*un(i,j,km)
! Calculate vn field
vn(i,j,kp)=vn(i,j,km)+ tdt*rmask(i,j)*(
+ tysav(i,j)*zn(nmm)/xpsi2(nmm)+visch*zetvn(i,j)
+ -recdy*(pn(i,jp,k)-pn(i,j,k))-a*vn(i,j,km)/cn(nmm)**2
+ -0.25*fv(i,j)*(un(im,jp,k)+un(i,jp,k)+un(im,j,k)+
+ un(i,j,k))
+ )
c EBnd damper is not used
cc + -(1./timkwd)*dampv(i,j)*vn(i,j,km)
! Calculate pn field
pn(i,j,kp)=pn(i,j,km)+tdt*rmask(i,j)*(
+ cn(nmm)**2*(
+ -recdx*(un(i,j,k)-un(im,j,k))
+ -recdy*(vn(i,j,k)-vn(i,jm,k)))
+ -a*pn(i,j,k)/cn(nmm)**2
+ -dampu(i,j)*cn(nmm)/dx*pn(i,j,km)
+ -dampv(i,j)*cn(nmm)/dx*pn(i,j,km)
+ -damp(i,j)*pn(i,j,km)
+ )
rhon(i,j)=-pn(i,j,kp)/g
wn(i,j)=
+ -recdx*(un(i,j,kp)-un(im,j,kp))
+ -recdy*(vn(i,j,kp)-vn(i,jm,kp))
endif
9001 continue
9000 continue
!$acc end parallel loop
You have data dependency and that means your algorithm is inherently sequential,
a simple example would be the difference between the Gauss-Seidel and Jacobi iterations, and why people use Jacobi in GPU's and not Gaus Seidel,