OpenMP accumulate array inside nested parallelism - fortran

I am converting an existing application to work with multiple threads and using OpenMP with nested parallelism for this purpose.
The code looks like this (Fortran)
!$omp parallel do private(array) ...
DO i=1...
...
C ---- plenty of code ----
...
!$omp parallel do private(z1,z2,z3,value)...
DO j=1...
...
!$omp critical
DO z1=..
DO z2=..
DO z3=..
...
value = ...
array(z1,z2,z3) = array(z1,z2,z3) + value
END DO
END DO
END DO
!$omp end critical
END DO
END DO
I added an OMP CRITICAL because the accumulation was not thread safe, but this is causing threads from other teams to wait unnecessarily.
What is the best way to parallelize this? Is there any way to make a reduction work in this case?

Related

Is there a way to completely stop all calculations on a thread?

Explanation of code and approach:
There are various mathematical methods (fortran subroutines) to solve a variable y, each method is sequential and runs on a single thread. The speed of each methods solution is dependent on unknown conditions (i.e. it is a no free lunch situation and I do not know which method is fastest). Therefor, the approach is to run each method on a separate thread, and once a method has found the solution, calculations on the other threads should stop (as they are required for operations after the parallel sections region)
!$omp parallel sections lastprivate(x, y)
!$omp section
call method_1_for_solving_y(x)
!$omp cancel sections
!$omp section
call method_2_for_solving_y(x)
!$omp cancel sections
. . .
!$omp section
call method_z_for_solving_y(x)
!$omp cancel sections
!$omp end parallel sections
The question:
The !$omp cancel sections construct does not completely cancel all operations on the threads that have not found the solution yet, is there a way to completely stop calculations on those threads?
Any additional advice, or possible other approaches would be appreciated.
Regards.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Nesting OMP DO directives - Fortran

I'm having problems trying to nest a OMP DO directive inside another OMP DO directive in Fortran.
Here's the following code:
DO in=2,n_niveles
allocate(cvalor(2,npuntosp(in),npuntost(in)))
!allocate(avalor(2,npuntosp(in-1),npuntost(in-1)))
allocate(valor_t2(npuntost(in),npuntosp(in-1),2))
!$OMP PARALLEL NUM_THREADS(hilos) DEFAULT(PRIVATE) FIRSTPRIVATE(n_niveles,in) SHARED(npuntosp,npuntost,cubos,central_reg,sumazm1n,expo,mphi,mtheta)
!$OMP DO SCHEDULE(STATIC)
DO aux=1,cubos(in-1)%ncubos_nivel
...
(some code here)
...
!$OMP PARALLEL NUM_THREADS(hilos) DEFAULT(PRIVATE) FIRSTPRIVATE(cuboj,in) SHARED(valor_t2,cvalor)
!$OMP DO SCHEDULE(STATIC)
do i=1,npuntost(in)
val=mtheta(in-1)%inicio(i,1)
do jj=val,val+mtheta(in-1)%inicio(i,2)
do k=1,npuntosp(in-1)
valor_t2(i,k,1)=valor_t2(i,k,1)+mtheta(in-1)%matriz(i,jj)*sumazm1n(in-1)%region(cuboj)%valor(1,k,jj)
valor_t2(i,k,2)=valor_t2(i,k,2)+mtheta(in-1)%matriz(i,jj)*sumazm1n(in-1)%region(cuboj)%valor(2,k,jj)
end do
end do
do k=1,npuntosp(in)
val=mphi(in-1)%inicio(k,1)
do jj=val,val+mphi(in-1)%inicio(k,2)
cvalor(1,k,i)=cvalor(1,k,i)+valor_t2(i,jj,1)*mphi(in-1)%matriz(jj,k)
cvalor(2,k,i)=cvalor(2,k,i)+valor_t2(i,jj,2)*mphi(in-1)%matriz(jj,k)
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
...
(some code here)
...
END DO
!$OMP END DO
!$OMP END PARALLEL
deallocate(cvalor)
deallocate(valor_t2)
END DO
When the code is executed, an access violation exception occurs inside the second OpenMP parallel region. Sometimes that exception is changed for an overflow at the variable valor_t2.
Maybe OpenMP does not support this kind of parallelization, but I've searched over the net and didn't found anything about. I know that OpenMP supports the use of various OMP PARALLEL directives nested one inside another and I know how it works. But I'm having a headache with this problem.
Any ideas about what it's happening?
Thank you so much!
You're going to want to use the collapse clause in the do loop at the top level. See the link below for information:
https://computing.llnl.gov/tutorials/openMP/
As long as the code represented by (some code here) doesn't contain any loops, this should work.

Summation error in openmp fortran

I am trying to sum up of a variable with openmp with code given below.
normr=0.0
!$omp parallel default(private) shared(nelem,normr,cell_data,alphar,betar,k)
!$omp do REDUCTION(+:normr)
do ii=1,nelem
nnodese=cell_data(ii)%num_vertex
pe=cell_data(ii)%porder
ndofe=cell_data(ii)%ndof
num_neighboure=cell_data(ii)%num_neighbour
be=>cell_data(ii)%Force
Ke=>cell_data(ii)%K
Me=>cell_data(ii)%M
pressuree=>cell_data(ii)%p
Rese=>cell_data(ii)%Res
neighbour_indexe=>cell_data(ii)%neighbour_index(:)
Rese(:)=be(:)
Rese(:)=Rese(:)-cmplx(-1.0,1.0*alphar/k)*matmul(Me(:,:),pressuree(:))
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Ke(:,:),pressuree(:))
do jj=1,num_neighboure
nbeindex=neighbour_indexe(jj)
Knbe=>cell_data(ii)%neighbour(jj)%Knb
pressurenb=>cell_data(nbeindex)%p
ndofnb=cell_data(nbeindex)%ndof
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Knbe(:,:),pressurenb(:))
nullify(pressurenb)
nullify(Knbe)
end do
normr=normr+dot_product(Rese(:),Rese(:))
nullify(pressuree)
nullify(Ke)
nullify(Me)
nullify(Rese)
nullify(neighbour_indexe)
nullify(be)
end do
!$omp end do
!$omp end parallel
The result for summed variable, normr, is different for parallel and sequantial code. In one of the posts I have seen that inner loop variable should be defined inside the parallel construct(Why I don't know). I also changed the pointers to locall allocated variables but result did not changed. normr is a saved real variable.
Any suggestions and helps will be appreciated.
Best Regards,
Gokmen
normr can be different for the parallel and the sequential code, because the summation does not take place in the same order. Hence, the difference does not need to be an error and can be expected from the reduction operation.
Not being an error does not necessary mean not being a problem. One way around this would be to move the summation out of the parallel loop:
!$omp parallel default(private) shared(... keep_dot_product)
!$OMP do
do ii=1,nelem
! ...
keep_dot_product(ii) = dot_product(Rese(:),Rese(:))
! ...
end do
!$omp end do
!$omp end parallel
normr = sum(keep_dot_product)

PARALLEL DO with or without CRITICAL?

Focusing in the parallel part of the code, which of the options presented below is preferred? Any better solution? I am trying to make an average of independent realizations of do_something
Option 1: Using CRITICAL
resultado%uno = 0.d0
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
!$OMP CRITICAL (forceloop)
resultado%uno = resultado%uno + resultadoOmp(i_omp)%uno
!$OMP END CRITICAL (forceloop)
enddo
!$OMP END PARALLEL DO
resultado%uno = resultado%uno/nthreads
Option 2: Avoiding CRITICAL (and ATOMIC)
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
enddo
!$OMP END PARALLEL DO
uno = 0.d0
!$OMP PARALLEL DO shared(resultado) private(i_omp) schedule(static,1) &
!$OMP & REDUCTION(+:uno)
do i_omp=1, nthreads
uno = uno + resultadoOmp(i_omp)%uno
end do
!$OMP END PARALLEL DO
resultado%uno = uno/nthreads
I couldn't use REDUCTION(+:resultado%uno) nor REDUCTION(+:resultado) in this respect, only numeric types are allowed.
The disadvantage of this approach, IMO, is that one has to dimension the derived tipe resultadoOmp with the number of threads. The advantage is that one avoids the CRITICAL clause that could affect the performance, I am right?
The disadvantage of this approach, IMO, is that one has to dimension the derived tipe resultadoOmp with the number of threads. The advantage is that one avoids the CRITICAL clause that could affect the performance, I am right?
Yes, you are right. It looks like you are dimensioning resultadoOmp with the number of threads anyway, so it is not really a disadvantage? Performance should indeed be better with the second part, though the two parallel regions might eat up this advantage again. Thus, you should only use a single parallel region for both parts. Depending on the running time of do_something I might even ignore parallelism for the reduction operation completely and just do a sum on a single thread after computing all uno entries in parallel:
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
end do
!$OMP END PARALLEL DO
resultado%uno = sum(resultadoOmp(:)%uno)/nthreads
You will need to measure the various implementations with your actual setup to draw a conclusion.