omp sum reduction not updating variable - fortran

I am having issues using the OpenMP sum reduction in my code.
en_par = 0.0d0
!$omp parallel do reduction(+:en_par) default(private) shared(r,listvar,it,ic)
!--- loop over neighboring cells
do cell_index = 1,26
!-- new_cell is an int. Neighbor of cell ic
new_cell = listvar%cv(cell_index,ic)%cnum
!--- loop over atoms in cell neigh_cell
do j = 1, listvar%cl(new_cell)%num
!--- pick particle in the cell list
!--- particle is an integer
particle = listvar%cl(new_cell)%cmem(j)
!--- obtain displacements
!--- apply minimum image here
!--- min variables are doubls
dx = r(1,particle)-r(1,it)-listvar%cv(cell_index,ic)%min_x
dy = r(2,particle)-r(2,it)-listvar%cv(cell_index,ic)%min_y
dz = r(3,particle)-r(3,it)-listvar%cv(cell_index,ic)%min_z
dxmin = r(1,particle)-r(1,it)
dymin = r(2,particle)-r(2,it)
dzmin = r(3,particle)-r(3,it)
dr2 = dx*dx+dy*dy+dz*dz
if(dr2.lt.param%rcut2)then
dr2i = 1.0d0/dr2
dr6i = dr2i*dr2i*dr2i
dr12i = dr6i*dr6i
en_par = en_par + dr12i-dr6i
endif
enddo
print*,'enpar inside',en_par
enddo
!$omp end parallel do
Whenever I compile and run this using -fopenmp, en_par is printed to the screen as 0.0. However when ran in serial execution, the value is not zero. Can anyone tell me why?

The reduction variable (here en_par) is not valid to access inside the parallel do, you have to print it after it finishes.

Related

OpenACC | Fortran 90: What is the best way to parallelize nested DO loop?

I am trying to parallelize the following nested DO loop structure (the first code below) using 'Collapse' directive in OpenACC. The variable 'nbl' present in the outermost loop is present in the other DO loops, so there is dependency. Thanks to the compiler its showing an error in advance. So I had to compromise and construct 'collapse' directive only to the remaining four inner most loops. Is there a way to parallelize this loop to get maximum performance by utilizing the parallelism of "nbl = 1,nblocks" as well?
Compiler: pgfortran
Flags: -acc -fast -ta=tesla:managed -Minfo=accel
Code that's giving error due to data dependency between outer most DO loop and other inner DO loops:
!$acc parallel loop collapse(5)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
Compromised working code with lesser parllelism:
DO nbl = 1,nblocks
!$acc parallel loop collapse(4)
DO n_prim = 1,nprims
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)
Px(i,j,k,nbl,n_prim) = i*j + Cx(i,j,k,nbl,1)*Cx(i,j,k,nbl,5) + Cx(i,j,k,nbl,2)
ENDDO
ENDDO
ENDDO
ENDDO
!$acc end parallel loop
ENDDO
Thanks!
The dependency is with the array look-ups for the upper bounds of the loops. In order to collapse loops, the iteration count of the loop must be known before entering, but here the count is variable.
Try something like the following and split the parallelism into two levels:
!$acc parallel loop collapse(2)
DO nbl = 1,nblocks
DO n_prim = 1,nprims
!$acc loop collapse(3)
DO k = 1, NK(nbl)
DO j = 1, NJ(nbl)
DO i = 1, NI(nbl)

How to parallelize the nested loop

A small example serial code, which has the same structure as my code, is shown below.
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j
DOUBLE PRECISION :: en,ei,es
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(i) + DBLE(j)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.
Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).
I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?
Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?
On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?
Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.
As mentioned in the comments, a good answer will require more detailed question. However, at a first sight it seems that parallelizing the internal loop
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
should be enough to solve your problem, or at least it will be a good starter. First of all I guess that there is an error on the loop
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(j,k) = DBLE(j) + DBLE(k)
END DO
END Do
since the k is set to 0 and and there is no cell with address corresponding to 0 (see your variable declaration). Also ki is declared ki(1000,2000) array while ki(j,i) is (2000,1000) array. Beside these error, I guess that ki should be calculated as
ki(i,j) = DBLE(j) + DBLE(i)
if true, I suggest you the following solution
PROGRAM MAIN
IMPLICIT NONE
INTEGER :: i, j, k,icr,icr0,icr1
DOUBLE PRECISION :: en,ei,es,timerRate
DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000)
INTEGER,PARAMETER:: nthreads=1
call system_clock(count_rate=icr)
timerRate=real(icr)
call system_clock(icr0)
call omp_set_num_threads(nthreads)
OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN')
DO i = 1, 1000, 1
DO j = 1, 2000, 1
ki(i,j) = DBLE(j) + DBLE(i)
END DO
END DO
DO i = 1, 200, 1
en = 2.0d0/DBLE(200)*(i-1)-1.0d0
et(i) = en
es = 0.0d0
!$OMP PARALLEL DO private(j,kn,ei) firstpribate(en) shared(ki) reduction(+:es)
DO j = 1, 1000, 1
kn=ki(j,:)
CALL CAL(en,kn,ei)
es = es + ei
END DO
!$OMP END PARALLEL DO
WRITE (UNIT=3, FMT=*) et(i), es
END DO
CLOSE(UNIT=3)
call system_clock(icr1)
write (*,*) (icr1-icr0)/timerRate ! return computing time
STOP
END PROGRAM MAIN
SUBROUTINE CAL (en,kn,ei)
IMPLICIT NONE
INTEGER :: i
DOUBLE PRECISION :: en, ei, gf,p
DOUBLE PRECISION :: kn(2000)
p = 3.14d0
ei = 0.0d0
DO i = 1, 2000, 1
gf = 1.0d0 / (en - kn(i) * p)
ei = ei + gf
END DO
RETURN
END SUBROUTINE CAL
I add some variables to check the computing time ;-).
This solution is computed in 5.14 s, for nthreads=1, and in 2.75 s, for nthreads=2. It does not divide the computing time by 2, but it seems to be a good deal for a first shot. Unfortunately, on this machine I have a core i3 proc. So I can't do better than nthreads=2. However, I wonder, how the code will behave with nthreads=16 ???
Please let me know
I hope that this helps you.
Finally, I warn about the choice of variables status (private, firstprivate and shared) that might be consider carefully in the real code.

Openmp parallel do loop working correctly ~50% of the time

I am currently working on adding openmp parallelization for a do loop in one of the codes I have written for research. I am fairly new to using openmp so I would appreciate if you had any suggestions for what might be going wrong.
Basically, I have added a parallel do loop to the following code (which works prior to parallelization). r(:,:,:,:) is a vector of a ton of molecular coordinates indexed by time, molecule, atom, and (xyz). This vector is about 100 gb of data (I am working on an HPC with plenty of RAM). I am trying to parallelize the outer loop and subdivide it between processors so that I can reduce the amount of time this calculation goes. I thought it would be a good one to do it with as msd and cm_msd are the only things that would need to be edited by multiple processors and stored for later, which since each iteration gets its own element of these arrays they won't have a race condition.
The problem: If I run this code 5 times I get varying results, sometimes msd is calculated correctly (or appears to be), and sometimes it outputs all zeros later when I average it together. Without parallelization there are no issues.
I have been trying altering the shared vs private variables in the code and I think I have accounted for everything. The i index of the msd array and msd_cm array should never be equivalent between threads so I would think that they wouldn't be an issue.
! Loop over time origins
counti = 0
ind = 0
!$OMP PARALLEL DO schedule(static) PRIVATE(i,j,k,it,r_old,r_cm_old,shift,shift_cm,dsq,ind) &
!$OMP& SHARED(msd,msd_cm)
do i=1, nconfigs-nt, or_int
if(MOD(counti*or_int,500) == 0) then
write(*,*) 'Reached the ', counti*or_int,'th time origin'
end if
! Set the Old Coordinates
counti = counti + 1
ind = (i-1)/or_int + 1
r_old(:,:,:) = r(i,:,:,:)
r_cm_old(:,:) = r_cm(i,:,:)
shift = 0.0
shift_cm = 0.0
! Loop over the timesteps in each trajectory
do it=i+2, nt+i
! Loop over molecules
do j = 1, nmols
do k=1, atms_per_mol
! Calculate the shift if it occurs.
shift(j,k,:) = shift(j,k,:) - L(:)*anint((r(it,j,k,:) - &
r_old(j,k,:) )/L(:))
! Calculate the square displacements
dsq = ( r(it,j,k,1) + shift(j,k,1) - r(i,j,k,1) ) ** 2. &
+( r(it,j,k,2) + shift(j,k,2) - r(i,j,k,2) ) ** 2. &
+( r(it,j,k,3) + shift(j,k,3) - r(i,j,k,3) ) ** 2.
msd(ind, it-1-i, k) = msd(ind, it-1-i, k) + dsq
! Calculate the contribution to the c1,c2
enddo ! End Atoms Loop (k)
! Calculate the shift if it occurs.
shift_cm(j,:) = shift_cm(j,:) - L(:)*anint((r_cm(it,j,:) - &
r_cm_old(j,:) )/L(:))
! Calculate the square displacements
dsq = ( r_cm(it,j,1) + shift_cm(j,1) - r_cm(i,j,1) ) ** 2. &
+( r_cm(it,j,2) + shift_cm(j,2) - r_cm(i,j,2) ) ** 2. &
+( r_cm(it,j,3) + shift_cm(j,3) - r_cm(i,j,3) ) ** 2.
msd_cm(ind,it-1-i) = msd_cm(ind, it-1-i) + dsq
enddo ! End Molecules Loop (j)
r_old(:,:,:) = r(it,:,:,:)
r_cm_old(:,:) = r_cm(it,:,:)
enddo ! End t's loop (it)
enddo
!$OMP END PARALLEL DO
When this code is run, when I later print the averaged msd results out they either come out as correctly or they come out as zero and it is always one or the other. Do you see an issue that might explain why it is only working part of the time. I am brand new to openmp so it is completely possible there is just something incredibly stupid with how I am trying to do this.
Thanks in advance!

Fortran OpenMP code much slower than its not parallel version

I want to solve the Random Walk problem, so i wrote a fortran sequental code and now i need to parallel this code.
subroutine random_walk(walkers)
implicit none
include "omp_lib.h"
integer :: i, j, col, row, walkers,m,n,iter
real, dimension(:, :), allocatable :: matrix, res
real :: point, z
col = 12
row = 12
allocate (matrix(row, col), res(row, col))
! Read from file
open(2, file='matrix.txt')
do i = 1, row
read(2, *)(matrix(i, j), j=1,col)
end do
res = matrix
! Solve task
!$omp parallel private(i,j,m,n,point,iter)
!$omp do collapse(2)
do i= 2, 11
do j=2, 11
m = i
n = j
iter = 1
point = 0
do while (iter <= walkers)
call random_number(z)
if (z <= 0.25) m = m - 1
if (z > 0.25 .and. z <= 0.5) n = n +1
if (z > 0.5 .and. z <= 0.75) m = m +1
if (z > 0.75) n = n - 1
if (m == 1 .or. m == 12 .or. n == 1 .or. n == 12) then
point = point + matrix(m, n)
m = i
n = j
iter = iter + 1
end if
end do
point = point / walkers
res(i, j) = point
end do
end do
!$omp end do
!$omp end parallel
! Write to file
open(2, file='out_omp.txt')
do i = 1, row
write(2, *)(res(i, j), j=1,col)
end do
contains
end
So, the problem is that parallel program computes MUCH lesser than its sequential version.
Where is the mistake?(except my terrible code)
Update: for now the code is with !$omp do directives, but the result is still the same: it is much lesser than its sequential version.
Most probably, the behavior is related to the random number extraction. RANDOM_NUMBER Fortran procedure is not even guaranteed to be thread-safe but it is thread-safe at least in GNU compiler thanks to a GNU extension. But in any case the performances seem to be very bad as you note.
If you switch to a different thread-safe random number generator, the scalability of your code can be good. I used the classical ran2.f generator:
http://www-star.st-and.ac.uk/~kw25/research/montecarlo/ran2.f
modified to make it thread-safe. If I am not wrong, to do that:
in the calling unit declare and define:
integer :: iv(32), iy, idum2, idum
idum2 = 123456789 ; iv(:) = 0 ; iy = 0
in OpenMP directives add idum as private and idum2, iv, iy as firstprivate (by the way you need to add z as private too)
in the parallel section add (before do)
idum = - omp_get_thread_num()
to have different random numbers for different threads
from ran2 function remove DATA and SAVE lines e pass idum2, iv, iy as arguments:
FUNCTION ran2(idum, iv, iy, idum2)
call ran2 instead of random_number intrinsic
z = ran2(idum, iv, iy, idum2)
With walkers=100000 (GNU compiler) these are my times:
1 thread => 4.7s
2 threads => 2.4s
4 threads => 1.5s
8 threads => 0.78s
16 threads => 0.49s
Not strictly related to the question but I have to say that extracting a real number for each 4 "bit"s info you need (+1 or -1) and the usage of conditionals can be probably changed using a more efficient strategy.

parallel do mistake in fortran

program main
use omp_lib
implicit none
integer :: n=8
integer :: i, j, myid, a(8, 8), b, c(8)
! Generate a 8*8 array A
!$omp parallel default(none), private(i, myid), &
!$omp shared(a, n)
myid = omp_get_thread_num()+1
do i = 1, n
a(i, myid) = i*myid
end do
!$omp end parallel
! Array A
print*, 'Array A is'
do i = 1, n
print*, a(:, i)
end do
! Sum of array A
b = 0
!$omp parallel reduction(+:b), shared(a, n), private(i, myid)
myid = omp_get_thread_num()+1
do i = 1, n
b = b + a(i, myid)
end do
!$omp end parallel
print*, 'Sum of array A by reduction is ', b
b = 0
c = 0
!$omp parallel do
do i = 1, n
do j = 1, n
c(i) = c(i) + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do is', sum(c)
!$omp parallel do
do i = 1, n
do j = 1, n
b = b + a(j, i)
end do
end do
!$omp end parallel do
print*, 'Sum of array A by using parallel do in another way is', b
end program main
I wrote a piece of Fortran code above to implement OpenMP to sum up all elements in a 8*8 array in three different ways. First one uses reduction and works. Second, I created a one dimension array with 8 elements. I sum up each column in parallel region and then sum them up. And this works as well. Third one I used an integer to sum up every element in array, and put it in parallel do region. This result is not correct and varies every time. I don't understand why this situation happens. Is because didn't specify public and private or the variable b is overwritten in the procedure?
There is a race condition on b on your third scenario: several threads are reading and writing the same variable without proper synchronization / privatization.
Note that you don't have a race condition in the second scenario: each thread is updating some data (i.e. c(i)) that no one else is accessing.
Finally, some solutions to your last scenario:
Add the reducion(+:b) clause to the pragma
Add a pragma omp atomic directive before the b = b + c(j,i) expression
You can implement a manual privatization