Summation error in openmp fortran - fortran

I am trying to sum up of a variable with openmp with code given below.
normr=0.0
!$omp parallel default(private) shared(nelem,normr,cell_data,alphar,betar,k)
!$omp do REDUCTION(+:normr)
do ii=1,nelem
nnodese=cell_data(ii)%num_vertex
pe=cell_data(ii)%porder
ndofe=cell_data(ii)%ndof
num_neighboure=cell_data(ii)%num_neighbour
be=>cell_data(ii)%Force
Ke=>cell_data(ii)%K
Me=>cell_data(ii)%M
pressuree=>cell_data(ii)%p
Rese=>cell_data(ii)%Res
neighbour_indexe=>cell_data(ii)%neighbour_index(:)
Rese(:)=be(:)
Rese(:)=Rese(:)-cmplx(-1.0,1.0*alphar/k)*matmul(Me(:,:),pressuree(:))
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Ke(:,:),pressuree(:))
do jj=1,num_neighboure
nbeindex=neighbour_indexe(jj)
Knbe=>cell_data(ii)%neighbour(jj)%Knb
pressurenb=>cell_data(nbeindex)%p
ndofnb=cell_data(nbeindex)%ndof
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Knbe(:,:),pressurenb(:))
nullify(pressurenb)
nullify(Knbe)
end do
normr=normr+dot_product(Rese(:),Rese(:))
nullify(pressuree)
nullify(Ke)
nullify(Me)
nullify(Rese)
nullify(neighbour_indexe)
nullify(be)
end do
!$omp end do
!$omp end parallel
The result for summed variable, normr, is different for parallel and sequantial code. In one of the posts I have seen that inner loop variable should be defined inside the parallel construct(Why I don't know). I also changed the pointers to locall allocated variables but result did not changed. normr is a saved real variable.
Any suggestions and helps will be appreciated.
Best Regards,
Gokmen

normr can be different for the parallel and the sequential code, because the summation does not take place in the same order. Hence, the difference does not need to be an error and can be expected from the reduction operation.
Not being an error does not necessary mean not being a problem. One way around this would be to move the summation out of the parallel loop:
!$omp parallel default(private) shared(... keep_dot_product)
!$OMP do
do ii=1,nelem
! ...
keep_dot_product(ii) = dot_product(Rese(:),Rese(:))
! ...
end do
!$omp end do
!$omp end parallel
normr = sum(keep_dot_product)

Related

Openmp nested parallelism use available threads

So, I have this simple Fortran do loop and inside that loop a couple of subroutines are called. I have
Made the do loop parallel with OpenMP, like this
!$omp parallel do
do i=1,n
call a()
call b()
enddo
!$omp end parallel do
Now most of the times the number of iterations in the loop is
less compared to the number of processor/threads available and the subroutines that are called inside the
loop can be called in parallel. So, is there a way to call the subroutines in parallel inside the parallel
do loop ? I have tried with task like this
!$omp parallel do
do i=1,n
!$omp task
call a(i , j )
!$omp end task
!$omp task
call b(i, k)
!$omp end task
!$omp taskwait
enddo
!$omp end parallel do
But this shows some error with segmentation fault. Is there any way to achieve this.
UPDATE:
So, I found out the main reason for the segmentation fault is coming from the fftw library. Lets consider a dummy program
program name
!$use omp_lib
implicit real*8(a-h,p-z)
call system_clock(count_rate=irate)
call system_clock(it1)
!$ call omp_set_nested(.true.)
!$omp parallel do
do i =1,5
call test(i)
print *, i
enddo
!$omp end parallel do
call system_clock(it2)
print *, (it2-it1)/real(irate, kind=8)
end program name
subroutine test(ii)
! just a dummy subroutine for heavy computation
implicit real*8(a-h,p-z)
do j=1,40000
!$omp task
do k=1,40000
x = exp(sqrt(sqrt(2.0d0*ii**3)**2))
enddo
!$omp end task
enddo
end subroutine
This program works exactly what I wants and using the task directives, uses the remaining threads and improves the performance. Now lets consider another dummy program but with fftw, similar to what I'm working.
program name
!$use omp_lib
implicit real*8(a-h,p-z)
integer, parameter :: n=8192*8
complex(kind=8) :: arr(n)
real(kind=8) :: tmp1(n), tmp2(n)
integer(kind=8) :: pF
integer :: i
call system_clock(count_rate=irate)
call dfftw_plan_dft_1d(pF,n,arr,arr,-1,0) ! forward
call system_clock(it1)
!$ call omp_set_nested(.true.)
!$omp parallel do private(arr)
do i =1,5
call random_number(tmp1)
call random_number(tmp2)
arr = cmplx(tmp1, tmp2, kind=8)
call test(pF, arr)
print *, i
enddo
!$omp end parallel do
call system_clock(it2)
print *, (it2-it1)/real(irate, kind=8)
end program name
subroutine test(pF, arr)
implicit real*8(a-h,p-z)
complex(kind=8) :: arr(:)
integer(kind=8) :: pF
do j=1,100
!$omp task private(arr)
do k=1, 100
call dfftw_execute_dft(pF, arr, arr)
enddo
!$omp end task
enddo
end subroutine
Now, this throws the segmentation fault. (NOTE: I have no random numer call in my actual program, they are here just for a dummy purpose). I have checked http://www.fftw.org/fftw3_doc/Thread-safety.html and fftw_execute is thread safe and the program works without the task directives. But with the task it throws error. Anyone knows how to fix this ?
Sigh, yet another example of why !$omp do parallel is a bad idea ... I really do think it is best to clearly separate the thread creation and worksharing phases.
As Vladimir says in the comments you haven't provided nearly enough detail to tell why you are getting a segmentation fault. However you seem to have a few misconceptions about OpenMP which I can try to address.
Firstly a very quick and dirty way to achieve what you want and avoiding any extra OpenMP directives is
!$omp parallel default( none ) private( i ) shared( n ) ! Create threads
!$omp do ! Now share out the work
Do i = 1, 2 * n
If( Mod( i, 2 ) == 1 ) Then
Call a
Else
Call b
End Do
!$omp end do
!$omp end parallel
However if you want to use tasks you're probably not doing it the easiest way if all calls to a and b are completely independent. In that case remember that a new task is created whenever ANY thread hits a !$omp task, and that that task can be executed by any thread, not just the one that created it. Following that logic something like
!$omp parallel default( none ) private( i ) shared( n ) ! Crate the threads
!$omp single
Do i = 1, n
!$omp task
Call a
!$omp end task
!$omp task
call b
!$omp end task
end do
!$omp end single
!$omp end parallel
is what you want - you use one thread to create the list of tasks, and then (or more probably while the list is being created) all the available threads will execute them, each task being taken by the next available thread. Note I have also missed out the taskwait directive as from your description I'm not sure why you think you need it as I can see no need for synchronisation at that point.

Openmp: Have a MASTER construct inside parallel do

I have a fortran code that looks like this
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed)
do ix = 1,nx
! Do parallel work
do iy = 1,ny
! Do a lot of work....
!$OMP ATOMIC
numberOfCalculationsPerformed = numberOfCalculationsPerformed+1
!$OMP END ATOMIC
!$OMP MASTER
! Report progress
call progressCallBack(numberOfCalculationsPerformed/totalNCalculations)
!$OMP END MASTER
end do
end do
When I try to compile it reports that
error #7102: An OpenMP* MASTER directive is not permitted in the
dynamic extent of a DO, PARALLEL DO, SECTIONS, PARALLEL SECTIONS, or
SINGLE directive.
I do not understand this. I have tried to modify the parallel do construct to this
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed), &
!$OMP& SCHEDULE(STATIC)
(in the thought that it had something to do with the scheduling) but that did nothing to change the error.
Does anyone know what I am not getting right? Is it just impossible to use master inside a parallel do construct or what? If that is so, are there alternatives?
Edit:
!$OMP SINGLE
!$OMP END SINGLE
Instead of the MASTER equivalent yields the same result... (error message)
Ps. I only need one of the threads to execute progressCallback.
The question is a bit old, but since I recently stumbled across the same issue, I wanted to share a simple solution. The idea is to formulate an if-clause which only evaluates to TRUE for one of the threads. This can easily be achieved by querying the current thread number. By requiring it to be zero, the clause is guaranteed to be true for at least one thread:
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed)
do ix = 1,nx
! Do parallel work
do iy = 1,ny
! Do a lot of work....
!$OMP ATOMIC
numberOfCalculationsPerformed = numberOfCalculationsPerformed+1
!$OMP END ATOMIC
if (OMP_GET_THREAD_NUM() == 0) then
! Report progress
call progressCallBack(numberOfCalculationsPerformed/totalNCalculations)
end if
end do
end do

Nesting OMP DO directives - Fortran

I'm having problems trying to nest a OMP DO directive inside another OMP DO directive in Fortran.
Here's the following code:
DO in=2,n_niveles
allocate(cvalor(2,npuntosp(in),npuntost(in)))
!allocate(avalor(2,npuntosp(in-1),npuntost(in-1)))
allocate(valor_t2(npuntost(in),npuntosp(in-1),2))
!$OMP PARALLEL NUM_THREADS(hilos) DEFAULT(PRIVATE) FIRSTPRIVATE(n_niveles,in) SHARED(npuntosp,npuntost,cubos,central_reg,sumazm1n,expo,mphi,mtheta)
!$OMP DO SCHEDULE(STATIC)
DO aux=1,cubos(in-1)%ncubos_nivel
...
(some code here)
...
!$OMP PARALLEL NUM_THREADS(hilos) DEFAULT(PRIVATE) FIRSTPRIVATE(cuboj,in) SHARED(valor_t2,cvalor)
!$OMP DO SCHEDULE(STATIC)
do i=1,npuntost(in)
val=mtheta(in-1)%inicio(i,1)
do jj=val,val+mtheta(in-1)%inicio(i,2)
do k=1,npuntosp(in-1)
valor_t2(i,k,1)=valor_t2(i,k,1)+mtheta(in-1)%matriz(i,jj)*sumazm1n(in-1)%region(cuboj)%valor(1,k,jj)
valor_t2(i,k,2)=valor_t2(i,k,2)+mtheta(in-1)%matriz(i,jj)*sumazm1n(in-1)%region(cuboj)%valor(2,k,jj)
end do
end do
do k=1,npuntosp(in)
val=mphi(in-1)%inicio(k,1)
do jj=val,val+mphi(in-1)%inicio(k,2)
cvalor(1,k,i)=cvalor(1,k,i)+valor_t2(i,jj,1)*mphi(in-1)%matriz(jj,k)
cvalor(2,k,i)=cvalor(2,k,i)+valor_t2(i,jj,2)*mphi(in-1)%matriz(jj,k)
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
...
(some code here)
...
END DO
!$OMP END DO
!$OMP END PARALLEL
deallocate(cvalor)
deallocate(valor_t2)
END DO
When the code is executed, an access violation exception occurs inside the second OpenMP parallel region. Sometimes that exception is changed for an overflow at the variable valor_t2.
Maybe OpenMP does not support this kind of parallelization, but I've searched over the net and didn't found anything about. I know that OpenMP supports the use of various OMP PARALLEL directives nested one inside another and I know how it works. But I'm having a headache with this problem.
Any ideas about what it's happening?
Thank you so much!
You're going to want to use the collapse clause in the do loop at the top level. See the link below for information:
https://computing.llnl.gov/tutorials/openMP/
As long as the code represented by (some code here) doesn't contain any loops, this should work.

Strange gfortran compilation error when adding Openmp directives

I have legacy fortran source file named pot.f,
which I need to apply OpenMP to parallel as shown below,but I can error messages about unexpected end state etc. But when I comment out $OMP lines by adding additional ! in the first column, there are not errors.
It is really weird to me. Can anybody tell me what went wrong?
subroutine pot_osc(rvp,R_pot,e_pot,pe_pot,ftmp,gtmp,vtmp,natoms)
implicit none
include 'sizes.h'
include 'constants.h'
include 'omp_lib.h'
double precision ftmp(maxatoms,3),gtmp(3),R_pot(maxatoms,3)
!$OMP PARALLEL WORKSHARE SHARED(gtmp,ftmp)
!$OMP PARALLEL NUM_THREADS(16)
gtmp = 0d0
ftmp = 0d0
!$OMP END PARALLEL WORKSHARE
return
end
subroutine pot_asym(rvp,vtmp)
implicit none
include 'constants.h'
return
end
Error messages:
end
1
Error: Unexpected END statement at (1)
subroutine pot_asym(rvp,vtmp)
1
Error: Unclassifiable statement at (1)
You start a second parallel section in the second OpenMP directive, which is not terminated by an end parallel. So the OpenMP directive should read
!$OMP PARALLEL WORKSHARE SHARED(gtmp,ftmp) NUM_THREADS(16)
gtmp = 0d0
ftmp = 0d0
!$OMP END PARALLEL WORKSHARE
or if you like to keep the line break use
!$OMP PARALLEL WORKSHARE SHARED(gtmp,ftmp) &
!$OMP NUM_THREADS(16)
gtmp = 0d0
ftmp = 0d0
!$OMP END PARALLEL WORKSHARE
In the past, I experienced some problems with exactly this kind of initialization. It seems that when compiled with gfortran the master thread did all the work. Even worse, by means of the "first-stouch principle", the whole array was located in the memory associated with the first thread. On our CCNUMA machine this lead to a huge slowdown.
To solve this I used explicit loops to initialize:
!$OMP PARALLEL DO SHARED(gtmp,ftmp) NUM_THREADS(16)
do i=1,maxatoms
ftmp(i,:) = 0d0
enddo
!$OMP END PARALLEL DO
! No need to do three elements in parallel
gtmp = 0d0
I don't know whether they fixed this problem, but I use this way of initialization for arrays in shared memory since then.

PARALLEL DO with or without CRITICAL?

Focusing in the parallel part of the code, which of the options presented below is preferred? Any better solution? I am trying to make an average of independent realizations of do_something
Option 1: Using CRITICAL
resultado%uno = 0.d0
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
!$OMP CRITICAL (forceloop)
resultado%uno = resultado%uno + resultadoOmp(i_omp)%uno
!$OMP END CRITICAL (forceloop)
enddo
!$OMP END PARALLEL DO
resultado%uno = resultado%uno/nthreads
Option 2: Avoiding CRITICAL (and ATOMIC)
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
enddo
!$OMP END PARALLEL DO
uno = 0.d0
!$OMP PARALLEL DO shared(resultado) private(i_omp) schedule(static,1) &
!$OMP & REDUCTION(+:uno)
do i_omp=1, nthreads
uno = uno + resultadoOmp(i_omp)%uno
end do
!$OMP END PARALLEL DO
resultado%uno = uno/nthreads
I couldn't use REDUCTION(+:resultado%uno) nor REDUCTION(+:resultado) in this respect, only numeric types are allowed.
The disadvantage of this approach, IMO, is that one has to dimension the derived tipe resultadoOmp with the number of threads. The advantage is that one avoids the CRITICAL clause that could affect the performance, I am right?
The disadvantage of this approach, IMO, is that one has to dimension the derived tipe resultadoOmp with the number of threads. The advantage is that one avoids the CRITICAL clause that could affect the performance, I am right?
Yes, you are right. It looks like you are dimensioning resultadoOmp with the number of threads anyway, so it is not really a disadvantage? Performance should indeed be better with the second part, though the two parallel regions might eat up this advantage again. Thus, you should only use a single parallel region for both parts. Depending on the running time of do_something I might even ignore parallelism for the reduction operation completely and just do a sum on a single thread after computing all uno entries in parallel:
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
end do
!$OMP END PARALLEL DO
resultado%uno = sum(resultadoOmp(:)%uno)/nthreads
You will need to measure the various implementations with your actual setup to draw a conclusion.