OpenMP parallel do read write race condition? - fortran

I am a little bit confused about the race conditions that can occur in OpenMP
Specifically, I have two arrays A and B that contains data, and I wish to use the data in one, compute something, and store it to other.
my fortran code would look like this
!$OMP PARALLEL DO PRIVATE(tmp,data)
DO i = 1, 10000
tmp = A(i) !!Extract A(i)
data = Do_Stuff(tmp) !!Compute
B(i)=data !!Store
END DO
!$OMP END PARALLEL DO
are there any lurking race conditions here?
I'm asking because in pages 11-12 in the introduction i'm reading the code bellow has this problem, even though the index i is different for all iterations.
!$OMP PARALLEL DO
do i = 1, 1000
B(i) = 10 * i
A(i) = A(i) + B(i)
end do
!$OMP END PARALLEL DO

There is a race condition in your first example.
The variable data is not explicitly given a data sharing attribute and doesn't have a predetermined attribute, consequently in a parallel construct it is shared. Multiple threads will read and write to it.
There is no such condition in your second example.

Related

Get the maximum value among OpenMP threads in Fortran

I have a do loop which updates T value and calculates the maximum difference during the iteration, called dumax.
I need to initialize dumax, so I set it to be
firstprivate
Then I will not be able to use:
reduction(max: dumax)
Reduction operator seems to accept private variable.
Then how can I get the maximum value of dumax before I end the parallel?
My program is shown below:
DUMAX=0.0D0
!$OMP PARALLEL DEFAULT(PRIVATE), SHARED(T_R, T_B), FIRSTPRIVATE(DUMAX)
!$OMP DO
DO I=2, N-1, 2
DO J=2, N-1, 2
T_OLD=T_B(I,J)
T_B(I,J)=0.25*(T_R(I,J-1)+T_R(I,J+1)+T_R(I+1,J)+&
T_R(I-1,J)-DX**2*S(I,J))
DUMAX=MAX(DUMAX, ABS(T_OLD-T_B(I,J)))
END DO
END DO
!$OMP END DO
!$OMP END PARALLEL
You should not set dumax to firstprivate. Reduction variables should be shared. Make it shared and then use reduction(max: dumax). Your initialization will be kept.

Calling subroutine in parallel environment

I think my problem is related or even identical to the problem described here. But I don't understand what's actually happening.
I'm using openMP with the gfortran compiler and I have the following task to do: I have a density distribution F(X, Y) on a two-dimensional surface with x-coordinates X and y-coordinates Y. The matrix F has the size Nx x Ny.
I now have a set of coordinates Xp(i) and Yp(i) and I need to interpolate the density F onto these points. This problem is made for parallelization.
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
Everything is shared except for i. The function interp2d is doing some simple linear interpolation.
That works fine with one thread but fails with multithreading. I traced the problem down to the hunt-subroutine taken from Numerical Recipes, which gets called by interp2d. The hunt-subroutine basically calculates the index ix such that X(ix) <= Xp(i) < X(ix+1). This is needed to get the starting point for the interpolation.
With multithreading it happens every now and then, that one threads gets the correct index ix from hunt and the thread, that calls hunt next gets the exact same index, even though Xp(i) is not even close to that point.
I can prevent this by using the CRITICAL environment:
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
!$OMP CRITICAL
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
!$OMP END CRITICAL
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
But this decreases the efficiency. If I use for example three threads, I have a load average of 1.5 with the CRITICAL environment. Without I have a load average of 2.75, but wrong results and even sometimes a SIGSEGV runtime error.
What exactly is happening here? It seems to me that all the threads are calling the same hunt-subroutine and if they do it at the same time there is a conflict. Does that make sense?
How can I prevent this?
Combining variable declaration and initialisation in Fortran 90+ has the side effect of giving the variable the SAVE attribute.
integer :: i = 0
is roughly equivalent to:
integer, save :: i
if (first_invocation) then
i = 0
end if
SAVE'd variables retain their value between multiple invocations of the routine and are therefore often implemented as static variables. By the rules governing the implicit data sharing classes in OpenMP, such variables are shared unless listed in a threadprivate directive.
OpenMP mandates that compliant compilers should apply the above semantics even when the underlying language is Fortran 77.

OpenMP race condition (Fortran 77 w/ COMMON block)

I am trying to parallelise some legacy Fortran code with OpenMP.
Checking for race conditions with Intel Inspector, I have come across a problem in the following code (simplified, tested example):
PROGRAM TEST
!$ use omp_lib
implicit none
DOUBLE PRECISION :: x,y,z
COMMON /firstcomm/ x,y,z
!$OMP THREADPRIVATE(/firstcomm/)
INTEGER :: i
!$ call omp_set_num_threads(3)
!$OMP PARALLEL DO
!$OMP+ COPYIN(/firstcomm/)
!$OMP+ PRIVATE(i)
do i=1,3000
z = 3.D0
y = z+log10(z)
x=y+z
enddo
!$OMP END PARALLEL DO
END PROGRAM TEST
Intel Inspector detects a race condition between the following lines:
!$OMP PARALLEL DO (read)
z = 3.D0 (write)
The Inspector "Disassembly" view offers the following about the two lines, respectively (I do not understand much about these, apart from the fact that the memory addresses in both lines seem to be different):
0x3286 callq 0x2a30 <memcpy>
0x3338 movq %r14, 0x10(%r12)
As in my main application, the problem occurs for one (/some) variable in the common block, but not for others that are treated in what appears to be the same way.
Can anyone spot my mistake, or is this race condition a false positive?
I am aware that the use of COMMON blocks, in general, is discouraged, but I am not able to change this for the current project.
Technically speaking, your example code is incorrect since you are using COPYIN to initialise threadprivate copies with data from uninitialised COMMON BLOCK. But that is not the reason for the data race - adding a DATA statement or simply assigning to x, y, and z before the parallel region does not change the outcome.
This is either a (very old) bug in Intel Fortran Compiler, or Intel is interpreting strangely the text of the OpenMP standard (section 2.15.4.1 of the current version):
The copy is done, as if by assignment, after the team is formed and prior to the start of execution of the associated structured block.
Intel implements the emphasised text by inserting a memcpy at the beginning of the outlined procedure. In other words:
!$OMP PARALLEL DO COPYIN(/firstcomm/)
do i = 1, 3000
...
end do
!$OMP END PARALLEL DO
becomes (in a mixture of Fortran and pseudo-code):
par_region0:
my_firstcomm = get_threadprivate_copy(/firstcomm/)
if (my_firstcomm != firstcomm) then
memcpy(my_firstcomm, firstcomm, size of firstcomm)
end if
// Actual implementation of the DO worksharing construct
call determine_iterations(1, 3000, low_it, high_it)
do i = low_it, high_it
...
... my_firstcomm used here instead of firstcomm
...
end do
call openmp_barrier
end par_region0
MAIN:
// Prepare a parallel region with 3 threads
// and fire the outlined code in the worker threads
call start_parallel_region(3, par_region0)
// Fire the outlined code in the master thread
call par_region0
call end_parallel_region
The outlined procedure first finds the address of the threadprivate copy of the common block, then compares that address to the address of the common block itself. If both addresses match, then the code is being executed in the master thread and no copy is needed, otherwise memcpy is called to make a bitwise copy of the master's data into the threadprivate block.
Now, one would expect that there should be a barrier at the end of the initialisation part and right before the start of the loop, and although Intel employees claim that there is one, there is none (tested with ifort 11.0, 14.0, and 16.0). Even more, the Intel Fortran Compiler does not honour the list of variables in the COPYIN clause and copies the entire common block if any variable contained in it is listed in the clause, i.e. COPYIN(x) is treated the same as COPYIN(/firstcomm/).
Whether those are bugs or features of Intel Fortran Compiler, only Intel could tell. It could also be that I'm misreading the assembly output. If anyone could find the missing barrier, please let me know. One possible workaround would be to split the combined directive and insert an explicit barrier before the worksharing construct:
!$OMP PARALLEL COPYIN(/firstcomm/) PRIVATE(I)
!$OMP BARRIER
!$OMP DO
do i = 1, 3000
z = 3.D0
y = z+log10(z)
x = y+z
end do
!$OMP END DO
!$OMP END PARALLEL
With that change, the data race will shift into the initialisation of the internal dispatch table within the log10 call, which is probably a false positive.
GCC implements COPYIN differently. It creates a shared copy of the threadprivate data of the master thread, which copy it then passes on to the worker threads for use in the copy process.

OpenMP SIMD vectorization of nested loop

I am trying to vectorize a nested loop using OpenMP 4.0's simd feature, but I'm afraid I'm doing it wrong. My loops looks like this:
do iy = iyfirst, iylast
do ix = ixfirst, ixlast
!$omp simd
do iz = izfirst, izlast
dudx(iz,ix,iy) = ax(1)*( u(iz,ix,iy) - u(iz,ix-1,iy) )
do ishift = 2, ophalf
dudx(iz,ix,iy) = dudx(iz,ix,iy) + ax(ishift)*( u(iz,ix+ishift-1,iy) - u(iz,ix-ishift,iy) )
enddo
dudx(iz,ix,iy) = dudx(iz,ix,iy)*buoy_x(iz,ix,iy)
enddo
!$omp end simd
enddo
enddo
Note that ophalf is a small integer, usually 2 or 4, so it makes sense to vectorize the iz loop and not the inner-most loop.
My question is: Do I have to mark ishift as a private variable?
In standard OpenMP parallel do loops, you certainly do need a private(ishift) to ensure other threads don't stomp over each other's data. Yet when I instead rewrite the first line as !$omp simd private(ishift), I get the ifort compilation error:
error #8592: Within a SIMD region, a DO-loop control-variable must not be specified in a PRIVATE SIMD clause. [ISHIFT]
Looking online, I couldn't find any successful resolution of this question. It seems to me that ishift should be private, but the compiler is not allowing it. Is an inner-loop variable automatically forced to be private?
Follow-up question: Later, when I add an omp parallel do around the iy loop, should I include a private(ishift) clause in the omp parallel do directive, the omp simd directive, or both?
Thanks for any clarifications.
Private clause when it comes to SIMD essentially means that the value of ishift is private to each SIMD lane within the SIMD register. This is true when we vectorize the innermost loop since ishift is the loop induction variable. But when you do a outer loop vectorization, every SIMD lane will have a different value for iz loop index, but given a loop index iz, ishift can still have values ranging from 2 to ophalf. So it doesn't qualify for private clause in SIMD context.
When it comes to multiple threads, you want copies of ishift so one thread incrementing this variable doesn't enable other thread skip that iteration. So private clause makes sense for ishift in omp parallel do context. It will be interesting to check the underlying code generation if the inner loop is completely unrolled and vectorized for the loop with loop index iz.

Summation error in openmp fortran

I am trying to sum up of a variable with openmp with code given below.
normr=0.0
!$omp parallel default(private) shared(nelem,normr,cell_data,alphar,betar,k)
!$omp do REDUCTION(+:normr)
do ii=1,nelem
nnodese=cell_data(ii)%num_vertex
pe=cell_data(ii)%porder
ndofe=cell_data(ii)%ndof
num_neighboure=cell_data(ii)%num_neighbour
be=>cell_data(ii)%Force
Ke=>cell_data(ii)%K
Me=>cell_data(ii)%M
pressuree=>cell_data(ii)%p
Rese=>cell_data(ii)%Res
neighbour_indexe=>cell_data(ii)%neighbour_index(:)
Rese(:)=be(:)
Rese(:)=Rese(:)-cmplx(-1.0,1.0*alphar/k)*matmul(Me(:,:),pressuree(:))
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Ke(:,:),pressuree(:))
do jj=1,num_neighboure
nbeindex=neighbour_indexe(jj)
Knbe=>cell_data(ii)%neighbour(jj)%Knb
pressurenb=>cell_data(nbeindex)%p
ndofnb=cell_data(nbeindex)%ndof
Rese(:)=Rese(:)-cmplx(1.0,1.0*k*betar)*matmul(Knbe(:,:),pressurenb(:))
nullify(pressurenb)
nullify(Knbe)
end do
normr=normr+dot_product(Rese(:),Rese(:))
nullify(pressuree)
nullify(Ke)
nullify(Me)
nullify(Rese)
nullify(neighbour_indexe)
nullify(be)
end do
!$omp end do
!$omp end parallel
The result for summed variable, normr, is different for parallel and sequantial code. In one of the posts I have seen that inner loop variable should be defined inside the parallel construct(Why I don't know). I also changed the pointers to locall allocated variables but result did not changed. normr is a saved real variable.
Any suggestions and helps will be appreciated.
Best Regards,
Gokmen
normr can be different for the parallel and the sequential code, because the summation does not take place in the same order. Hence, the difference does not need to be an error and can be expected from the reduction operation.
Not being an error does not necessary mean not being a problem. One way around this would be to move the summation out of the parallel loop:
!$omp parallel default(private) shared(... keep_dot_product)
!$OMP do
do ii=1,nelem
! ...
keep_dot_product(ii) = dot_product(Rese(:),Rese(:))
! ...
end do
!$omp end do
!$omp end parallel
normr = sum(keep_dot_product)