It seems to me that having a critical section withic a parallel block in OpenMP makes no sense! I might just well write a simple serial do loop right?
In the following (trivial) example instead of
1.
!$omp critical
!$ thread_num = omp_get_thread_num()
print *, "Hello world from thread number ", thread_num
!$omp end critical
!$omp end parallel
2.
do i=1,num_threads
print *, "Hello world from thread number ", thread_num
end do
That being said, I understand the difference: 1. uses different threads while 2. doesn't.
Is there a non-trivial context where the former might actually provide a speed advantage over the latter?
The $omp critical specifies that code is executed by one thread at a time. So your both examples not run parallel but in serial way. The sense of using critical section in clearly described on wiki so look there to find details (the typical situation is when all thread needs wait for some common value (calculated earlier in parallel way e.g. some sort of elements sum) to continue calculations)
Related
Explanation of code and approach:
There are various mathematical methods (fortran subroutines) to solve a variable y, each method is sequential and runs on a single thread. The speed of each methods solution is dependent on unknown conditions (i.e. it is a no free lunch situation and I do not know which method is fastest). Therefor, the approach is to run each method on a separate thread, and once a method has found the solution, calculations on the other threads should stop (as they are required for operations after the parallel sections region)
!$omp parallel sections lastprivate(x, y)
!$omp section
call method_1_for_solving_y(x)
!$omp cancel sections
!$omp section
call method_2_for_solving_y(x)
!$omp cancel sections
. . .
!$omp section
call method_z_for_solving_y(x)
!$omp cancel sections
!$omp end parallel sections
The question:
The !$omp cancel sections construct does not completely cancel all operations on the threads that have not found the solution yet, is there a way to completely stop calculations on those threads?
Any additional advice, or possible other approaches would be appreciated.
Regards.
I am converting an existing application to work with multiple threads and using OpenMP with nested parallelism for this purpose.
The code looks like this (Fortran)
!$omp parallel do private(array) ...
DO i=1...
...
C ---- plenty of code ----
...
!$omp parallel do private(z1,z2,z3,value)...
DO j=1...
...
!$omp critical
DO z1=..
DO z2=..
DO z3=..
...
value = ...
array(z1,z2,z3) = array(z1,z2,z3) + value
END DO
END DO
END DO
!$omp end critical
END DO
END DO
I added an OMP CRITICAL because the accumulation was not thread safe, but this is causing threads from other teams to wait unnecessarily.
What is the best way to parallelize this? Is there any way to make a reduction work in this case?
I am trying to parallelise some legacy Fortran code with OpenMP.
Checking for race conditions with Intel Inspector, I have come across a problem in the following code (simplified, tested example):
PROGRAM TEST
!$ use omp_lib
implicit none
DOUBLE PRECISION :: x,y,z
COMMON /firstcomm/ x,y,z
!$OMP THREADPRIVATE(/firstcomm/)
INTEGER :: i
!$ call omp_set_num_threads(3)
!$OMP PARALLEL DO
!$OMP+ COPYIN(/firstcomm/)
!$OMP+ PRIVATE(i)
do i=1,3000
z = 3.D0
y = z+log10(z)
x=y+z
enddo
!$OMP END PARALLEL DO
END PROGRAM TEST
Intel Inspector detects a race condition between the following lines:
!$OMP PARALLEL DO (read)
z = 3.D0 (write)
The Inspector "Disassembly" view offers the following about the two lines, respectively (I do not understand much about these, apart from the fact that the memory addresses in both lines seem to be different):
0x3286 callq 0x2a30 <memcpy>
0x3338 movq %r14, 0x10(%r12)
As in my main application, the problem occurs for one (/some) variable in the common block, but not for others that are treated in what appears to be the same way.
Can anyone spot my mistake, or is this race condition a false positive?
I am aware that the use of COMMON blocks, in general, is discouraged, but I am not able to change this for the current project.
Technically speaking, your example code is incorrect since you are using COPYIN to initialise threadprivate copies with data from uninitialised COMMON BLOCK. But that is not the reason for the data race - adding a DATA statement or simply assigning to x, y, and z before the parallel region does not change the outcome.
This is either a (very old) bug in Intel Fortran Compiler, or Intel is interpreting strangely the text of the OpenMP standard (section 2.15.4.1 of the current version):
The copy is done, as if by assignment, after the team is formed and prior to the start of execution of the associated structured block.
Intel implements the emphasised text by inserting a memcpy at the beginning of the outlined procedure. In other words:
!$OMP PARALLEL DO COPYIN(/firstcomm/)
do i = 1, 3000
...
end do
!$OMP END PARALLEL DO
becomes (in a mixture of Fortran and pseudo-code):
par_region0:
my_firstcomm = get_threadprivate_copy(/firstcomm/)
if (my_firstcomm != firstcomm) then
memcpy(my_firstcomm, firstcomm, size of firstcomm)
end if
// Actual implementation of the DO worksharing construct
call determine_iterations(1, 3000, low_it, high_it)
do i = low_it, high_it
...
... my_firstcomm used here instead of firstcomm
...
end do
call openmp_barrier
end par_region0
MAIN:
// Prepare a parallel region with 3 threads
// and fire the outlined code in the worker threads
call start_parallel_region(3, par_region0)
// Fire the outlined code in the master thread
call par_region0
call end_parallel_region
The outlined procedure first finds the address of the threadprivate copy of the common block, then compares that address to the address of the common block itself. If both addresses match, then the code is being executed in the master thread and no copy is needed, otherwise memcpy is called to make a bitwise copy of the master's data into the threadprivate block.
Now, one would expect that there should be a barrier at the end of the initialisation part and right before the start of the loop, and although Intel employees claim that there is one, there is none (tested with ifort 11.0, 14.0, and 16.0). Even more, the Intel Fortran Compiler does not honour the list of variables in the COPYIN clause and copies the entire common block if any variable contained in it is listed in the clause, i.e. COPYIN(x) is treated the same as COPYIN(/firstcomm/).
Whether those are bugs or features of Intel Fortran Compiler, only Intel could tell. It could also be that I'm misreading the assembly output. If anyone could find the missing barrier, please let me know. One possible workaround would be to split the combined directive and insert an explicit barrier before the worksharing construct:
!$OMP PARALLEL COPYIN(/firstcomm/) PRIVATE(I)
!$OMP BARRIER
!$OMP DO
do i = 1, 3000
z = 3.D0
y = z+log10(z)
x = y+z
end do
!$OMP END DO
!$OMP END PARALLEL
With that change, the data race will shift into the initialisation of the internal dispatch table within the log10 call, which is probably a false positive.
GCC implements COPYIN differently. It creates a shared copy of the threadprivate data of the master thread, which copy it then passes on to the worker threads for use in the copy process.
I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...
Focusing in the parallel part of the code, which of the options presented below is preferred? Any better solution? I am trying to make an average of independent realizations of do_something
Option 1: Using CRITICAL
resultado%uno = 0.d0
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
!$OMP CRITICAL (forceloop)
resultado%uno = resultado%uno + resultadoOmp(i_omp)%uno
!$OMP END CRITICAL (forceloop)
enddo
!$OMP END PARALLEL DO
resultado%uno = resultado%uno/nthreads
Option 2: Avoiding CRITICAL (and ATOMIC)
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
enddo
!$OMP END PARALLEL DO
uno = 0.d0
!$OMP PARALLEL DO shared(resultado) private(i_omp) schedule(static,1) &
!$OMP & REDUCTION(+:uno)
do i_omp=1, nthreads
uno = uno + resultadoOmp(i_omp)%uno
end do
!$OMP END PARALLEL DO
resultado%uno = uno/nthreads
I couldn't use REDUCTION(+:resultado%uno) nor REDUCTION(+:resultado) in this respect, only numeric types are allowed.
The disadvantage of this approach, IMO, is that one has to dimension the derived tipe resultadoOmp with the number of threads. The advantage is that one avoids the CRITICAL clause that could affect the performance, I am right?
The disadvantage of this approach, IMO, is that one has to dimension the derived tipe resultadoOmp with the number of threads. The advantage is that one avoids the CRITICAL clause that could affect the performance, I am right?
Yes, you are right. It looks like you are dimensioning resultadoOmp with the number of threads anyway, so it is not really a disadvantage? Performance should indeed be better with the second part, though the two parallel regions might eat up this advantage again. Thus, you should only use a single parallel region for both parts. Depending on the running time of do_something I might even ignore parallelism for the reduction operation completely and just do a sum on a single thread after computing all uno entries in parallel:
!$OMP PARALLEL DO shared(large) private(i_omp) schedule(static,1)
do i_omp=1, nthreads
call do_something(large, resultadoOmp(i_omp))
end do
!$OMP END PARALLEL DO
resultado%uno = sum(resultadoOmp(:)%uno)/nthreads
You will need to measure the various implementations with your actual setup to draw a conclusion.