Get the maximum value among OpenMP threads in Fortran - fortran

I have a do loop which updates T value and calculates the maximum difference during the iteration, called dumax.
I need to initialize dumax, so I set it to be
firstprivate
Then I will not be able to use:
reduction(max: dumax)
Reduction operator seems to accept private variable.
Then how can I get the maximum value of dumax before I end the parallel?
My program is shown below:
DUMAX=0.0D0
!$OMP PARALLEL DEFAULT(PRIVATE), SHARED(T_R, T_B), FIRSTPRIVATE(DUMAX)
!$OMP DO
DO I=2, N-1, 2
DO J=2, N-1, 2
T_OLD=T_B(I,J)
T_B(I,J)=0.25*(T_R(I,J-1)+T_R(I,J+1)+T_R(I+1,J)+&
T_R(I-1,J)-DX**2*S(I,J))
DUMAX=MAX(DUMAX, ABS(T_OLD-T_B(I,J)))
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL

You should not set dumax to firstprivate. Reduction variables should be shared. Make it shared and then use reduction(max: dumax). Your initialization will be kept.

Related

Fortran & OpenMP: How to declare and allocate an allocatable THREADPRIVATE array

I had a serial code where I would declare a bunch of variables in modules and then use those modules across the rest of my program and subroutines. Now I am trying to parallelize this code. There is a portion of the code that I want to run in parallel which seems to be working except for one array, gtmp. I want each thread to have it's own version of gtmp and I want that version to be private to its respective thread, so I've used the threadprivate directive. gtmp is only used inside the parallel region of the code or within subroutines that are only called from the parallel part of the code.
At first I allocated gtmp in a serial portion of the code before the parallel portion, but that was an issue because then only the master thread 'version' of gtmp got allocated and the other thread 'versions' of gtmp had a size of 1 rather than the expected allocated size of gtmp, (this was shown by the "test" print statement). I think this happened because the master thread is the only thread executing code in the serial portions. So, I moved the allocate line into the parallel region, which allowed all threads to have appropriately sized/allocated gtmp arrays, but since my parallel region is inside a loop I get an error when the program tries to allocate gtmp a second time in the second iteration of the r loop.
Note: elsewhere in the code all the other variables in mymod are given values.
Here is a simplified portion of the code that is having the issue:
module mymod
integer :: xBins, zBins, rBins, histCosThBins, histPhiBins, cfgRBins
real(kind=dp),allocatable :: gtmp(:,:,:)
end module mymod
subroutine compute_avg_force
use mymod
implicit none
integer :: r, i, j, ip
integer :: omp_get_thread_num, tid
! I used to allocate 'gtmp' here.
do r = 1, cfgRBins
!$omp PARALLEL DEFAULT( none ) &
!$omp PRIVATE( ip, i, j, tid ) &
!$omp SHARED( r, xBins, zBins, histCosThBins, histPhiBins )
allocate( gtmp(4,0:histCosThBins+1,0:histPhiBins+1) )
tid = omp_get_thread_num() !debug
print*, 'test', tid, histCosThBins, histPhiBins, size(gtmp)
!$omp DO SCHEDULE( guided )
do ip = 1, (xBins*zBins)
call subroutine_where_i_alter_gtmp(...)
...code to be executed in parallel using gtmp...
end do !ip
!$omp END DO
!$omp END PARALLEL
end do !r
end subroutine compute_avg_force
So, the issue is coming from the fact that I need all threads to be active, (ie. in a parallel region), to appropriately initialize all 'versions' of gtmp but my parallel region is inside a loop and I can't allocate gtmp more than once.
In short, what is the correct way to allocate gtmp in this code? I've thought that I could just make another omp parallel region before the loop and use that to allocate gtmp but that seems clunky so I'm wondering what the "right" way to do something like this is.
Thanks for the help!

OpenMP SIMD vectorization of nested loop

I am trying to vectorize a nested loop using OpenMP 4.0's simd feature, but I'm afraid I'm doing it wrong. My loops looks like this:
do iy = iyfirst, iylast
do ix = ixfirst, ixlast
!$omp simd
do iz = izfirst, izlast
dudx(iz,ix,iy) = ax(1)*( u(iz,ix,iy) - u(iz,ix-1,iy) )
do ishift = 2, ophalf
dudx(iz,ix,iy) = dudx(iz,ix,iy) + ax(ishift)*( u(iz,ix+ishift-1,iy) - u(iz,ix-ishift,iy) )
enddo
dudx(iz,ix,iy) = dudx(iz,ix,iy)*buoy_x(iz,ix,iy)
enddo
!$omp end simd
enddo
enddo
Note that ophalf is a small integer, usually 2 or 4, so it makes sense to vectorize the iz loop and not the inner-most loop.
My question is: Do I have to mark ishift as a private variable?
In standard OpenMP parallel do loops, you certainly do need a private(ishift) to ensure other threads don't stomp over each other's data. Yet when I instead rewrite the first line as !$omp simd private(ishift), I get the ifort compilation error:
error #8592: Within a SIMD region, a DO-loop control-variable must not be specified in a PRIVATE SIMD clause. [ISHIFT]
Looking online, I couldn't find any successful resolution of this question. It seems to me that ishift should be private, but the compiler is not allowing it. Is an inner-loop variable automatically forced to be private?
Follow-up question: Later, when I add an omp parallel do around the iy loop, should I include a private(ishift) clause in the omp parallel do directive, the omp simd directive, or both?
Thanks for any clarifications.
Private clause when it comes to SIMD essentially means that the value of ishift is private to each SIMD lane within the SIMD register. This is true when we vectorize the innermost loop since ishift is the loop induction variable. But when you do a outer loop vectorization, every SIMD lane will have a different value for iz loop index, but given a loop index iz, ishift can still have values ranging from 2 to ophalf. So it doesn't qualify for private clause in SIMD context.
When it comes to multiple threads, you want copies of ishift so one thread incrementing this variable doesn't enable other thread skip that iteration. So private clause makes sense for ishift in omp parallel do context. It will be interesting to check the underlying code generation if the inner loop is completely unrolled and vectorized for the loop with loop index iz.

OpenMP parallel do read write race condition?

I am a little bit confused about the race conditions that can occur in OpenMP
Specifically, I have two arrays A and B that contains data, and I wish to use the data in one, compute something, and store it to other.
my fortran code would look like this
!$OMP PARALLEL DO PRIVATE(tmp,data)
DO i = 1, 10000
tmp = A(i) !!Extract A(i)
data = Do_Stuff(tmp) !!Compute
B(i)=data !!Store
END DO
!$OMP END PARALLEL DO
are there any lurking race conditions here?
I'm asking because in pages 11-12 in the introduction i'm reading the code bellow has this problem, even though the index i is different for all iterations.
!$OMP PARALLEL DO
do i = 1, 1000
B(i) = 10 * i
A(i) = A(i) + B(i)
end do
!$OMP END PARALLEL DO
There is a race condition in your first example.
The variable data is not explicitly given a data sharing attribute and doesn't have a predetermined attribute, consequently in a parallel construct it is shared. Multiple threads will read and write to it.
There is no such condition in your second example.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Summation error in openmp fortran

I am trying to sum up of a variable with openmp with code given below.
normr=0.0
!$omp parallel default(private) shared(nelem,normr,cell_data,alphar,betar,k)
!$omp do REDUCTION(+:normr)
do ii=1,nelem
nnodese=cell_data(ii)%num_vertex
pe=cell_data(ii)%porder
ndofe=cell_data(ii)%ndof
num_neighboure=cell_data(ii)%num_neighbour
be=>cell_data(ii)%Force
Ke=>cell_data(ii)%K
Me=>cell_data(ii)%M
pressuree=>cell_data(ii)%p
Rese=>cell_data(ii)%Res
neighbour_indexe=>cell_data(ii)%neighbour_index(:)
Rese(:)=be(:)
Rese(:)=Rese(:)-cmplx(-1.0,1.0*alphar/k)*matmul(Me(:,:),pressuree(:))
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Ke(:,:),pressuree(:))
do jj=1,num_neighboure
nbeindex=neighbour_indexe(jj)
Knbe=>cell_data(ii)%neighbour(jj)%Knb
pressurenb=>cell_data(nbeindex)%p
ndofnb=cell_data(nbeindex)%ndof
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Knbe(:,:),pressurenb(:))
nullify(pressurenb)
nullify(Knbe)
end do
normr=normr+dot_product(Rese(:),Rese(:))
nullify(pressuree)
nullify(Ke)
nullify(Me)
nullify(Rese)
nullify(neighbour_indexe)
nullify(be)
end do
!$omp end do
!$omp end parallel
The result for summed variable, normr, is different for parallel and sequantial code. In one of the posts I have seen that inner loop variable should be defined inside the parallel construct(Why I don't know). I also changed the pointers to locall allocated variables but result did not changed. normr is a saved real variable.
Any suggestions and helps will be appreciated.
Best Regards,
Gokmen
normr can be different for the parallel and the sequential code, because the summation does not take place in the same order. Hence, the difference does not need to be an error and can be expected from the reduction operation.
Not being an error does not necessary mean not being a problem. One way around this would be to move the summation out of the parallel loop:
!$omp parallel default(private) shared(... keep_dot_product)
!$OMP do
do ii=1,nelem
! ...
keep_dot_product(ii) = dot_product(Rese(:),Rese(:))
! ...
end do
!$omp end do
!$omp end parallel
normr = sum(keep_dot_product)