OpenMP race condition (Fortran 77 w/ COMMON block) - fortran

I am trying to parallelise some legacy Fortran code with OpenMP.
Checking for race conditions with Intel Inspector, I have come across a problem in the following code (simplified, tested example):
PROGRAM TEST
!$ use omp_lib
implicit none
DOUBLE PRECISION :: x,y,z
COMMON /firstcomm/ x,y,z
!$OMP THREADPRIVATE(/firstcomm/)
INTEGER :: i
!$ call omp_set_num_threads(3)
!$OMP PARALLEL DO
!$OMP+ COPYIN(/firstcomm/)
!$OMP+ PRIVATE(i)
do i=1,3000
z = 3.D0
y = z+log10(z)
x=y+z
enddo
!$OMP END PARALLEL DO
END PROGRAM TEST
Intel Inspector detects a race condition between the following lines:
!$OMP PARALLEL DO (read)
z = 3.D0 (write)
The Inspector "Disassembly" view offers the following about the two lines, respectively (I do not understand much about these, apart from the fact that the memory addresses in both lines seem to be different):
0x3286 callq 0x2a30 <memcpy>
0x3338 movq %r14, 0x10(%r12)
As in my main application, the problem occurs for one (/some) variable in the common block, but not for others that are treated in what appears to be the same way.
Can anyone spot my mistake, or is this race condition a false positive?
I am aware that the use of COMMON blocks, in general, is discouraged, but I am not able to change this for the current project.

Technically speaking, your example code is incorrect since you are using COPYIN to initialise threadprivate copies with data from uninitialised COMMON BLOCK. But that is not the reason for the data race - adding a DATA statement or simply assigning to x, y, and z before the parallel region does not change the outcome.
This is either a (very old) bug in Intel Fortran Compiler, or Intel is interpreting strangely the text of the OpenMP standard (section 2.15.4.1 of the current version):
The copy is done, as if by assignment, after the team is formed and prior to the start of execution of the associated structured block.
Intel implements the emphasised text by inserting a memcpy at the beginning of the outlined procedure. In other words:
!$OMP PARALLEL DO COPYIN(/firstcomm/)
do i = 1, 3000
...
end do
!$OMP END PARALLEL DO
becomes (in a mixture of Fortran and pseudo-code):
par_region0:
my_firstcomm = get_threadprivate_copy(/firstcomm/)
if (my_firstcomm != firstcomm) then
memcpy(my_firstcomm, firstcomm, size of firstcomm)
end if
// Actual implementation of the DO worksharing construct
call determine_iterations(1, 3000, low_it, high_it)
do i = low_it, high_it
...
... my_firstcomm used here instead of firstcomm
...
end do
call openmp_barrier
end par_region0
MAIN:
// Prepare a parallel region with 3 threads
// and fire the outlined code in the worker threads
call start_parallel_region(3, par_region0)
// Fire the outlined code in the master thread
call par_region0
call end_parallel_region
The outlined procedure first finds the address of the threadprivate copy of the common block, then compares that address to the address of the common block itself. If both addresses match, then the code is being executed in the master thread and no copy is needed, otherwise memcpy is called to make a bitwise copy of the master's data into the threadprivate block.
Now, one would expect that there should be a barrier at the end of the initialisation part and right before the start of the loop, and although Intel employees claim that there is one, there is none (tested with ifort 11.0, 14.0, and 16.0). Even more, the Intel Fortran Compiler does not honour the list of variables in the COPYIN clause and copies the entire common block if any variable contained in it is listed in the clause, i.e. COPYIN(x) is treated the same as COPYIN(/firstcomm/).
Whether those are bugs or features of Intel Fortran Compiler, only Intel could tell. It could also be that I'm misreading the assembly output. If anyone could find the missing barrier, please let me know. One possible workaround would be to split the combined directive and insert an explicit barrier before the worksharing construct:
!$OMP PARALLEL COPYIN(/firstcomm/) PRIVATE(I)
!$OMP BARRIER
!$OMP DO
do i = 1, 3000
z = 3.D0
y = z+log10(z)
x = y+z
end do
!$OMP END DO
!$OMP END PARALLEL
With that change, the data race will shift into the initialisation of the internal dispatch table within the log10 call, which is probably a false positive.
GCC implements COPYIN differently. It creates a shared copy of the threadprivate data of the master thread, which copy it then passes on to the worker threads for use in the copy process.

Related

Fortran Openmp: Ghost threads interrupt application upon allocate

I am experiencing an issue which starts to drive me crazy. As part of a large software tool I am writing a Fortran module to perform astrodynamic computations.
The main functionality is contained in a subroutine which uses OpenMP and other subroutines to perform its tasks.
The workflow is as follows:
1) Preparations - read in files
2) Start the first parallel omp parallel do
3) Based on the working mode, the function might call itself recursively and process a separate part of the data (thereby executing a second omp do loop)
4) Combine the results into one large array of a derived type
Up to this point everything has been running as expected. Next is the part that fails:
5) Using the results from 4, start another parallel loop to continue the processing.
Step 5 sometimes works and sometimes it stops with an access violation. This is independently of the optimization level selected (nominally we use O2, but it happens also with O0).
Hence I fired up Intel inspector and looked for data races/deadlocs,....
It reported 5 very strange data races, which make no sense to me, as the code locations are either at the definition/end of a subroutine, a codeline reading threadprivate global variable, an Intel MKL routine or at a location reading a local allocatable array which is in a subroutine that is called within the parallel region.
After reading upon things a bit, I enabled the recursive switch to force the local array to end on the stack. Another Inspector analysis was interrupted by the same access violation. When loading the results no errors were detected, but the tool warns that due to the abnormal end data may have been lost.
I then tried to put the entire loop into an OMP critical - just as a test. Now the access violation disappeared, but the code is stuck very early.After processing 15/500 objects it just stops. This looked like a deadlock.
So I continued the search and commented out every OMP statement in the do loop of step 5.
Interestingly the code stops as well at iteration 15!
Hence I used the debugger to locate the call where the infinite waiting state occurs.
It is at the final allocate statement:
Subroutine doWork(t0, Pert, allAtmosData, considerArray, perigeeFirstTimeStepUncertainties)
real(dp), intent(in) :: t0
type (tPert), intent(in) :: Pert
type (tThermosphereData), dimension(:), allocatable, intent(in) :: allAtmosData
integer(i4), dimension(3), intent(in) :: considerArray
real(dp), dimension(:,:), allocatable, intent(out) :: perigeeFirstTimeStepUncertainties
!Locals:
real(dp), dimension(:), allocatable :: solarFluxPerigeeUnc, magneticIndexUnc, modelUnc
integer(i4) :: dataPoints
!Allocate memory for the computations
dataPoints = size(allAtmosData, 1)
allocate(solarFluxPerigeeUnc(0:dataPoints-1), source=0.0_dp)
allocate(magneticIndexUnc(0:dataPoints-1), source=0.0_dp)
allocate(modelUnc(0:dataPoints-1), source=0.0_dp)
... Subroutine body ...
!Assign outputs
allocate(perigeeFirstTimeStepUncertainties(3, 0:dataPoints-1), source=0.0_dp)
perigeeFirstTimeStepUncertainties(MGN_SOLAR_FLUX_UNCERTAINTY,:) = solarFluxPerigeeUnc
perigeeFirstTimeStepUncertainties(MGN_MAG_INDEX_UNCERTAINTY,:) = magneticIndexUnc
perigeeFirstTimeStepUncertainties(MGN_MODEL_UNCERTAINTY,:) = modelUnc
End Subroutine
The subroutine works perfectly in other places of the overall program and also for the first
14 iterations of step 5. In the 15th iteration however it is stuck at the last allocate, or as I just have managed to produce - also results in an access violation.
When I pause the code using the debugger, I see that somehow the openmp library has been loaded and the code hangs/crashes in frontend.cpp, which I have no access to:
What is happening here? How can a simple allocate suddenly cause and OMP activity in a loop where any OMP statement has been commented out?
As a sidenote: If I comment out the allocate and pass the size as an extra argument, the same error happens at the next allocate in the code.
Any help or workaround is highly appreciated! I already googled on how to shut down any prior thread-pool, so that there is definately no omp ghost thread activity left, but apparently there is no way to do so. Also ultimately the loop shall work in parallel.
EDIT: I just did a second test on the machine with 2018 v5. It also shows that the allocate statement results in a call to for_alloc_allocatable() and then the OMP statements. Also the mov statement where the program crashes is shown:
I guess that something is probably causing memory corruption in the second recursive OMP loop, as somehow the allocate is redirected to a threadpool memory allocation, however all threads but the main one should be passive? I cannot read assembler, but I was surprised to find that TBB is used by OpenMp? Could this be a bug in TBB?

Fortran & OpenMP: How to declare and allocate an allocatable THREADPRIVATE array

I had a serial code where I would declare a bunch of variables in modules and then use those modules across the rest of my program and subroutines. Now I am trying to parallelize this code. There is a portion of the code that I want to run in parallel which seems to be working except for one array, gtmp. I want each thread to have it's own version of gtmp and I want that version to be private to its respective thread, so I've used the threadprivate directive. gtmp is only used inside the parallel region of the code or within subroutines that are only called from the parallel part of the code.
At first I allocated gtmp in a serial portion of the code before the parallel portion, but that was an issue because then only the master thread 'version' of gtmp got allocated and the other thread 'versions' of gtmp had a size of 1 rather than the expected allocated size of gtmp, (this was shown by the "test" print statement). I think this happened because the master thread is the only thread executing code in the serial portions. So, I moved the allocate line into the parallel region, which allowed all threads to have appropriately sized/allocated gtmp arrays, but since my parallel region is inside a loop I get an error when the program tries to allocate gtmp a second time in the second iteration of the r loop.
Note: elsewhere in the code all the other variables in mymod are given values.
Here is a simplified portion of the code that is having the issue:
module mymod
integer :: xBins, zBins, rBins, histCosThBins, histPhiBins, cfgRBins
real(kind=dp),allocatable :: gtmp(:,:,:)
end module mymod
subroutine compute_avg_force
use mymod
implicit none
integer :: r, i, j, ip
integer :: omp_get_thread_num, tid
! I used to allocate 'gtmp' here.
do r = 1, cfgRBins
!$omp PARALLEL DEFAULT( none ) &
!$omp PRIVATE( ip, i, j, tid ) &
!$omp SHARED( r, xBins, zBins, histCosThBins, histPhiBins )
allocate( gtmp(4,0:histCosThBins+1,0:histPhiBins+1) )
tid = omp_get_thread_num() !debug
print*, 'test', tid, histCosThBins, histPhiBins, size(gtmp)
!$omp DO SCHEDULE( guided )
do ip = 1, (xBins*zBins)
call subroutine_where_i_alter_gtmp(...)
...code to be executed in parallel using gtmp...
end do !ip
!$omp END DO
!$omp END PARALLEL
end do !r
end subroutine compute_avg_force
So, the issue is coming from the fact that I need all threads to be active, (ie. in a parallel region), to appropriately initialize all 'versions' of gtmp but my parallel region is inside a loop and I can't allocate gtmp more than once.
In short, what is the correct way to allocate gtmp in this code? I've thought that I could just make another omp parallel region before the loop and use that to allocate gtmp but that seems clunky so I'm wondering what the "right" way to do something like this is.
Thanks for the help!

Calling subroutine in parallel environment

I think my problem is related or even identical to the problem described here. But I don't understand what's actually happening.
I'm using openMP with the gfortran compiler and I have the following task to do: I have a density distribution F(X, Y) on a two-dimensional surface with x-coordinates X and y-coordinates Y. The matrix F has the size Nx x Ny.
I now have a set of coordinates Xp(i) and Yp(i) and I need to interpolate the density F onto these points. This problem is made for parallelization.
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
Everything is shared except for i. The function interp2d is doing some simple linear interpolation.
That works fine with one thread but fails with multithreading. I traced the problem down to the hunt-subroutine taken from Numerical Recipes, which gets called by interp2d. The hunt-subroutine basically calculates the index ix such that X(ix) <= Xp(i) < X(ix+1). This is needed to get the starting point for the interpolation.
With multithreading it happens every now and then, that one threads gets the correct index ix from hunt and the thread, that calls hunt next gets the exact same index, even though Xp(i) is not even close to that point.
I can prevent this by using the CRITICAL environment:
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
!$OMP CRITICAL
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
!$OMP END CRITICAL
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
But this decreases the efficiency. If I use for example three threads, I have a load average of 1.5 with the CRITICAL environment. Without I have a load average of 2.75, but wrong results and even sometimes a SIGSEGV runtime error.
What exactly is happening here? It seems to me that all the threads are calling the same hunt-subroutine and if they do it at the same time there is a conflict. Does that make sense?
How can I prevent this?
Combining variable declaration and initialisation in Fortran 90+ has the side effect of giving the variable the SAVE attribute.
integer :: i = 0
is roughly equivalent to:
integer, save :: i
if (first_invocation) then
i = 0
end if
SAVE'd variables retain their value between multiple invocations of the routine and are therefore often implemented as static variables. By the rules governing the implicit data sharing classes in OpenMP, such variables are shared unless listed in a threadprivate directive.
OpenMP mandates that compliant compilers should apply the above semantics even when the underlying language is Fortran 77.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

openMP not improving runtime

I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical!
I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...
Code sample 1: (significant improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
Code sample 2: (no improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.
I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.
Can anyone shed some light onto what I can to do see a real speed boost in the second one?
Data on speed boost for code sample 1:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 2m 21.321s
With openMP tags: 2m 20.640s
Average runtime (profile for just this snippet)
Without openMP tags: 6.3s
With openMP tags: 4.75s
Data on speed boost for code sample 2:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 4m 46.659s
With openMP tags: 4m 49.200s
Average runtime (profile for just this snippet)
Without openMP tags: 15.14s
With openMP tags: 46.63s
The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.
The SCATT array is shared and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.
The reason is that each thread loads a portion of the array SCATT in cache and whenever another thread writes in that portion of SCATT this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.
The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT at all. Just replace
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
with
SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0
where SCATT0 is a private variable.
And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2)) it very likely stored it in a register and used this value instead of SCATT(IN1,IN2) in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0.
Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare around the last line)
DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
temp = (IN2-1)*NN(1)
SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0
You can do something similar with snippet 2 as well.