OpenMP parallelization of two do-loops with condition on indexes - fortran

I've tried to parallelize a code contains such a double do-loop. It's not efficient for sure, but that's not a big problem now.
The output tauv is NaN. That is the first problem.
The second problem is that Intel compiler gives fatal error with number of threads less than maximum number of threads (equals 8 for my machine).
How could I treat those problems?
!$omp parallel do private(i,j, ro11,ro21,ro12,ro22, &
u11,u21,u12,u22, &
v11,v21,v12,v22, &
es11,es21,es12,es22, &
p11,p21,p12,p22, &
te11,te21,te12,te22, &
emu11,emu21,emu12,emu22) &
shared(i1l, i2l, j1l, j2l, emumax, tauv, tauvij, ro, u, v, es)
do i=i1l+2,i2l-2,2
do j=j1l+2,j2l-2,2
if (i.le.niii.and.i.ge.0.and.j.ge.0.and.j.le.nj.or.&
i.le.ni.and.i.ge.niik.and.j.gt.njjv.and.j.le.nj.or.&
i.le.ni.and.i.ge.niik.and.j.ge.0.and.j.lt.njjn&
.or.i.gt.niii.and.i.lt.niik.and.j.gt.njj0+i-niii&
.or.i.gt.niii.and.i.lt.niik.and.j.lt.njj0-i+niii) then
ro11=ro(i-1,j-1)
ro21=ro(i+1,j-1)
ro12=ro(i-1,j+1)
ro22=ro(i+1,j+1)
u11=u(i-1,j-1)
u21=u(i+1,j-1)
u12=u(i-1,j+1)
u22=u(i+1,j+1)
v11=v(i-1,j-1)
v21=v(i+1,j-1)
v12=v(i-1,j+1)
v22=v(i+1,j+1)
es11=es(i-1,j-1)
es21=es(i+1,j-1)
es12=es(i-1,j+1)
es22=es(i+1,j+1)
p11=(es11-0.5*ro11*(u11*u11+v11*v11))*ga1
p21=(es21-0.5*ro21*(u21*u21+v21*v21))*ga1
p12=(es12-0.5*ro12*(u12*u12+v12*v12))*ga1
p22=(es22-0.5*ro22*(u22*u22+v22*v22))*ga1
te11=p11/ro11
te21=p21/ro21
te12=p12/ro12
te22=p22/ro22
emu11=te11**1.5*(1.0+s1)/(te11+s1)
emu21=te21**1.5*(1.0+s1)/(te21+s1)
emu12=te12**1.5*(1.0+s1)/(te12+s1)
emu22=te22**1.5*(1.0+s1)/(te22+s1)
emumax=emu11
if (emu21.gt.emumax) then
emumax=emu21
end if
if (emu12.gt.emumax) then
emumax=emu12
end if
if (emu22.gt.emumax) then
emumax=emu22
end if
tauvij=re*flkv*hx*hx/emumax
if (tauvij .le. tauv) then
tauv=tauvij
endif
endif
enddo
enddo
!$omp end parallel do

The thing is that it executes without error, but OpenMP do-loop computes more slowly than sequential one...
From your reproducible example:
1.) Your code is only using 1 thread (?) in OpenMP region:
! Set number of threads
nthreads = 1
call omp_set_num_threads(nthreads)
print *, 'The number of threads are used is ', omp_get_max_threads ( )
I would avoid the call omp_set_num_threads(). Insted, specify number of threads with environmental variable OMP_NUM_THREADS. For unix machine: export OMP_NUM_THREADS=<number of threads>
2.) In your "reproducible" example, the parallelized loop (line 312) is missing private/shared declarations? From what you wrote above, fix to:
!$omp parallel do default(private) shared(i1l, i2l, j1l, j2l, emumax, tauv, tauvij, ro, u, v, es)
With all of the above, the result I get from my machine (4c/4t) using GNU Fortran compiler is:
...
Executed time in SEQ code is 60.2720146
...
Executed time in OMP code is 27.1342430

Related

How to properly use OpenMP for this nested loop (Fortran)

I have recently started to parallelize a serial code I've been developing and was curious if anyone had input on how to properly apply OpenMP for these loops.
F_vol = 0.0_bp
G_vol = 0.0_bp
H_vol = 0.0_bp
RHS_vol = 0.0_bp
!$OMP PARALLEL DO PRIVATE(e,i1,j1,F,G,H)
DO e=1,NE
DO i1=1,Np
DO j1=1,NPTS
CALL Flux(Q(j1,e,:),F,G,H)
F_Vol(i1,e,:) = F_Vol(i1,e,:) + Stuff(j1)*F(:)
G_Vol(i1,e,:) = G_vol(i1,e,:) + Stuff(j1)*G(:)
H_Vol(i1,e,:) = H_vol(i1,e,:) + Stuff(j1)*H(:)
END DO
END DO
END DO
!$OMP END PARALLEL DO
As a note, the arrays, F, G, and H are of size 5 and are temporary arrays. Additionally, F_Vol,G_Vol,H_Vol is of dimension (NE,Np,5) The part I am unsure on is, how to properly parallelize the arrays I sum from j1=1,NPTS. Given that they are not dependent on each other but vary between i1,e, I think using PRIVATE() is required. as to avoid overwriting. Lastly, I am fixated on these loops as according to GProf, a good portion of my computational expense is in this area of code.

slow down of a parallel subroutine when called inside a loop

I have parallelized a subroutine. It have very good benchmark : speedup 4X on a quad core. I have them in two different source: serial.f and paral.f. The comparison is made running them from terminal and printing elapsed wall clock time. Inside each source code there is only call to the associate subroutine. But, when I modify the sources like this :
serial.f :
do i=1,100
call serial
end do
and like this
paral.f :
do i=1,100
call paral
end do
performance goes down to 0.96 X speed: the parallel version is bad than the serial one! The code can be found in why calling many N times a serial subroutine is faster than calling N times the parallel version of the same subroutin
For obtaining the serial.f just comment the block containing the call paral. For obtaining the paral.f just comment the block containing the call serial.
I'm asking : is this a common problem ? How can I solve it to maintain the 4 X speedup maintaning the loop call?
Please note :
(1)I've tried translating to C and timing, benchmarks and problems remains all the same
(2) I've tried translating to modern fortran and timing, benchmarks and problems remains all the same
(3) I've tried all kind of tricks and rewriting of the code. I'm sure the problem is not how the subroutine is parallelized (I achieved 4 X ) but that it is called too many times inside a loop.
Thank you.
EDIT ::
As requested, I'm posting a program written in modern fortran who esibit the same issues :
program main
use omp_lib
implicit none
integer ( kind = 4 ), parameter :: m = 5000
integer ( kind = 4 ), parameter :: n = 5000
integer ( kind = 4 ) i
integer ( kind = 4 ) j
integer ( kind = 4 ) nn
real ( kind = 8 ) u(m,n)
real ( kind = 8 ) w(m,n)
real ( kind = 8 ) wtime,h
call random_seed()
do j=1,n
do i=1,m
call random_number(u(i,j))
end do
end do
wtime = omp_get_wtime ( )
do nn=1,100
!$omp parallel do default(none) shared(u, w) private(i,j)
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) )
end do
end do
!$omp end parallel do
end do
wtime = omp_get_wtime ( ) - wtime
h=0.0D+00
do j=1,n
do i=1,m
h=h+w(i,j)
end do
end do
write ( *, '(a,g14.6)' ) ' Wall clock time serial= ', wtime
write ( *, '(a,g14.6)' ) ' h ', h
stop
end
In order to get serial_with_loop.f90 just comment openmp directives and the nn loop. You must obtain also with a similar method parall_with_loop.f90 and serial and parall without loop. You can compile with " gfortran -o name.out -fopenmp -O3 name.f90 " and launch from terminal with output redirection to text file "name.out > time_result.txt"
The problem you have is that you are parallelizing the loop on j that is located inside a loop on nn. Therefore, for each nn value, your machine needs time to create a pool of threads that do the job for different value of j. Therefore, this time (required for creating the pool) is serial and cannot be devided by the number of used threads. As I see your code, there is no reason for not being able to parallelize the nn loop and creating that pool only once, instead of nn times. I think that your code will work better if you write
wtime = omp_get_wtime ( )
!$omp parallel do default(none) shared(u, w) private(nn,i,j)
do nn=1,100
do j = 2, n - 1
do i = 2, m - 1
w(i,j) = 0.25D+00 * ( u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1))
end do
end do
end do
!$omp end parallel do
wtime = omp_get_wtime ( ) - wtime
I hope that this helps you.

OpenMP: how to protect an array from race condition

This is a follow up to question 36182486, 41421437 and several others. I want to speed up the assembly of skewness and mass matrices for a FEM calculation by using multiple processors to deal with individual elements in parallel. This little MWE shows the guts of the operation.
!! compile with gfortran -fopenmp -o FEMassembly FEMassembly.f90
Program FEMassembly
use, intrinsic :: iso_c_binding
implicit none
real (c_double) :: arrayM(3,3)=reshape((/2.d0,1.d0,1.d0,1.d0,&
&2.d0,1.d0,1.d0,1.d0,2.d0/),(/3,3/)) ! contrib from one element
integer (c_int) :: ke,ne=4,kx,nx=6,nodes(3)
real (c_double) :: L(6,6)
integer (c_int) :: t(4,3)=reshape((/1,2,5,6,2,3,4,5,4,5,2,3/),(/4,3/))
!! first, no OMP
do ke=1,ne ! for each triangular element
nodes=t(ke,:)
L(nodes,nodes)=L(nodes,nodes)+arrayM
end do
print *,'L no OMP'
write(*,fmt="(6(1x,f3.0))")(L(kx,1:6),kx=1,nx)
L=0
!$omp parallel do private (nodes)
do ke=1,ne ! for each triangular element
nodes=t(ke,:)
!! !$omp atomic
L(nodes,nodes)=L(nodes,nodes)+arrayM
!! !$omp end atomic
end do
!$omp end parallel do
print *,'L with OMP and race'
write(*,fmt="(6(1x,f3.0))")(L(kx,1:6),kx=1,nx)
End Program FEMassembly
With the atomic directives commented out, the array L contains several wrong values, presumably because of the race condition I was trying to avoid with the atomic directives. The results are:
L no OMP
2. 1. 0. 1. 0. 0.
1. 6. 1. 2. 2. 0.
0. 1. 4. 0. 2. 1.
1. 2. 0. 4. 1. 0.
0. 2. 2. 1. 6. 1.
0. 0. 1. -0. 1. 2.
L with OMP and race
2. 1. 0. 1. 0. 0.
1. 6. 1. 2. 2. 0.
0. 1. 2. 0. 2. 1.
1. 2. 0. 4. 1. 0.
0. 2. 2. 1. 6. 1.
0. 0. 1. 0. 1. 2.
If the "atomic" directives are uncommented, the compiler return the error:
Error: !$OMP ATOMIC statement must set a scalar variable of intrinsic type at (1)
where (1) points to arrayM in the line L(nodes,nodes).....
What I am hoping to achieve is have the time consuming contributions from each element (here the trivial arrayM) happen in parallel, but since several threads address the same matrix element, something has to be done to have the sum occur in an orderly fashion. Can anyone suggest a way to do this?
In Fortran the simplest way is to use a reduction. This is because OpenMP for Fortran supports reductions on arrays. Below is what I think you are trying to do, but take it with a pinch of salt because
You don't provide the correct output so it's difficult to test
With such a small array sometimes race conditions are difficult to find
!! compile with gfortran -fopenmp -o FEMassembly FEMassembly.f90
Program FEMassembly
use, intrinsic :: iso_c_binding
Use omp_lib, Only : omp_get_num_threads
implicit none
real (c_double) :: arrayM(3,3)=reshape((/2.d0,1.d0,1.d0,1.d0,&
&2.d0,1.d0,1.d0,1.d0,2.d0/),(/3,3/)) ! contrib from one element
integer (c_int) :: ke,ne=4,nodes(3)
real (c_double) :: L(6,6)
integer (c_int) :: t(4,3)=reshape((/1,2,5,6,2,3,4,5,4,5,2,3/),(/4,3/))
! Not declared in original program
Integer :: nx, kx
! Not set in original program
nx = Size( L, Dim = 1 )
!$omp parallel default( none ) private ( ke, nodes ) shared( ne, t, L, arrayM )
!$omp single
Write( *, * ) 'Working on ', omp_get_num_threads(), ' threads'
!$omp end single
!$omp do reduction( +:L )
do ke=1,ne ! for each triangular element
nodes=t(ke,:)
L(nodes,nodes)=L(nodes,nodes)+arrayM
end do
!$omp end do
!$omp end parallel
write(*,fmt="(6(1x,f3.0))")(L(kx,1:6),kx=1,nx)
End Program FEMassembly

forrtl: severe (151): allocatable array is already allocated-

/var/spool/torque/mom_priv/jobs/775.head.cluster.SC: line 22: 28084 Segmentation fault ./a.out
I am new in Fortran and this is the first time I work with HPC and OpenMP.
In my code, I have a loop that should be parallel. I use some dynamic variables that all of them are dummy in the parallel loop.
I allocate the dynamic variables in parallel loop
!$OMP PARALLEL DO
do 250 iconf = 1,config
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& x(nfftdim1),y(nfftdim2),z(nfftdim3),fr1(num), &
& fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
... call some subroutines and do calculations ...
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,x,y,z,Q)
250 continue
!$OMP END PARALLEL DO
I omited some irrelevant part of code. When the program is executed, this error occurs:
forrtl: severe (151): allocatable array is already allocated
I allocated the variables outside the parallel region, it works for small data, but for large data this error occurs:
/var/spool/torque/mom_priv/jobs/775.head.cluster.SC: line 22: 28084 Segmentation fault ./a.out
I used PRIVATE clause for dynamic variables (dummy variables):
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(randx,randy,randz, &
!$OMP& unit_cg,fr1,fr2,fr3,theta1,theta2,theta3,x,y,z,Q, &
!$OMP& dir_ene,rec_ene,corr_ene,energy_final,plproduct_avg, &
!$OMP& correlation_term)
and allocated variables inside parallel loop, but the same error,
at last I changed the code to:
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& x(nfftdim1),y(nfftdim2),z(nfftdim3),fr1(num), &
& fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(randx,randy,randz, &
!$OMP& unit_cg,fr1,fr2,fr3,theta1,theta2,theta3,x,y,z,Q, &
!$OMP& dir_ene,rec_ene,corr_ene,energy_final,plproduct_avg, &
!$OMP& correlation_term)
do 250 iconf = 1,config
... call some subroutines and do calculations ...
250 continue
!$OMP END PARALLEL DO
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,x,y,z,Q)
it fails at run-time. it starts N (number of thread) loops, but can not complete them, and again this error:
/var/spool/torque/mom_priv/jobs/775.head.cluster.SC: line 22: 28084 Segmentation fault ./a.out
any idea?
I changed the code and finally it WORKS!
The directive !$OMP PARALLEL DO is the shortcut of two directives !$OMP PARALLEL and !$OMP DO. I used these two directives (instead of !$OMP PARALLEL DO) and put allocation inside parallel region. I guess (but I'm not sure), now the compiler knows how to get memories for private variables, because I put private clause before allocation and so the segmentation fault dose not occur.
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(iconf,d,randx, &
!$OMP& randy,randz,unit_cg,theta1,theta2,theta3,fr1,fr2,fr3,Q, &
!$OMP& plproduct_avg)
allocate(randx(num),randy(num),randz(num),unit_cg(num), &
& fr1(num),fr2(num),fr3(num),theta1(order,num), &
& theta2(order,num),theta3(order,num), &
& Q(nfftdim1,nfftdim2,nfftdim3))
!$OMP DO
do 250 iconf = 1,config
... call some subroutines and do calculations ...
250 continue
!$OMP END DO
deallocate(randx,randy,randz,unit_cg,fr1,fr2,fr3,theta1,theta2, &
& theta3,Q)
!$OMP END PARALLEL

Understanding the correct use of !$omp parallel do reduction(...)

I am trying to write a program that counts the number of primes between 1 and some number n in Fortran 90 utilizing OpenMP. The nested loop just counts the numbers that are not prime. I want to use an omp parallel do to speed this up. As far as I understand, since I am just counting numbers that are not prime, it is appropriate to just use something like !$omp parallel do reduction(+:not_primes). When I run the code below in serial without the !$omp lines I get the following output
Primes: 5134
OpenMP time elapsed 0.49368596076965332
but when I include the !$omp lines I get
Primes: -1606400834
OpenMP time elapsed 0.37933206558227539
Have I used the parallel do correctly here? (apparently not, but why?) Thanks!
program prime_counter
integer n, not_primes, i, j
real*8 :: ostart,oend, omp_get_wtime
ostart = omp_get_wtime()
n=50000
!$omp parallel do reduction(+:not_primes)
do i=2,n
do j=2,i-1
if(mod(i,j)==0) then
not_primes= not_primes+1
exit
end if
end do
end do
!$omp end parallel do
print*, 'Primes:', n-not_primes
oend = omp_get_wtime()
write(*,*) 'OpenMP time elapsed', oend-ostart
end program
You do not initialize not_primes anywhere, it is undefined. The usage of the OpenMP reduction is OK. The index j should be marked as private, I normally mark all indexes as private, but that is not necessary.
not_primes = 0
!$omp parallel do reduction(+:not_primes) private(i,j)