Fortran OpenMP - process things in parallel as they are released - fortran

I'm trying to speed up an existing Fortran program with OpenMP.
A fixed number of things are to be processed in parallel, some of which may be processed immediately, while others can only be processed after they are released later. In the end, each thing gets released exactly once. Not compilable pseudocode...
! Each thing, 1..numthings, is released exactly once --- some now, some later
numleft = numthings
do i = 1, numthings
if ( ok_to_release(i) ) then
release(i)
end if
end do
!$OMP PARALLEL DO SCHEDULE(DYNAMIC,1) DEFAULT(SHARED) PRIVATE(i)
do while( .TRUE. )
wait until there is at least one released, unprocessed thing
i = the first released, unprocessed thing ! no other thread gets i
numleft = numleft - 1
process(i) ! may release zero or more other things
if ( numleft < numthreads ) exit ! probably incorrect exit condition
end do
!$OMP END PARALLEL DO
Am I on the right track please? Can a multi-server fifo queue like this be easily implemented with OpenMP?
Thanks, Peter McGavin.

Related

Openmp fortran, task dependency on overlapping subarrays

TLDR: task parallel code fails to recognize task dependency on partially overlapping subarrays
I'm trying to parallelize some fortran code using task parallelism. I'm struggling to get the task dependency to work.
The following minimal example illustrates my problem:
program test
integer, parameter :: n = 100
double precision :: A(n,n),B(n,n)
integer :: istart, istop
!$omp parallel
!$omp single
istart = 20
istop = 50
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 1"
call sleep(5)
write(*,*) "stopping task 1"
!$omp end task
istart = 30
istop = 60
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 2"
call sleep(5)
write(*,*) "stopping task 2"
!$omp end task
!$omp end single nowait
!$omp end parallel
end program
Which I compile using gcc version 9.4.0 and the -fopenmp flag.
The code has two tasks. Task 1 depends on A(20:50,20:50) and task 2 depends on A(30:60,30:60). These subarrays overlap, so I would expect task 2 to wait until task 1 has completed, but the output I get is:
starting task 1
starting task 2
stopping task 1
stopping task 2
If I comment out the lines
istart = 30
istop = 60
so that the subarrays are exactly the same instead of just overlapping, task 2 does wait on task 1.
Are task dependencies with overlapping subarrays not supported in openmp? Or am I defining the dependencies wrong somehow?
These subarrays overlap
This is forbidden by the OpenMP standard as pointed out by #ThijsSteel.
This choice has been done because of the resulting runtime overhead. Indeed, checking overlapping regions of array in dependencies was very expensive in pathological cases (especially for arrays with many dimensions). A typical example of pathological case when many tasks write on the same array (but on different exclusive parts) and then one tasks operate on the whole array. This create a lot of checks and dependencies while this can be optimized using a specific dependency pattern. An even more pathological case is a transposition of the sub-arrays (from n horizontal bands to n vertical bands): this quickly results in n² dependencies which is insanely inefficient when n is large. An optimization consists in aggregating the dependencies so to get 2*n dependencies but this is expensive to do at runtime and a better way would actually not to use dependency but task group synchronization.
From the user point-of-view, there are only few options:
operate at bigger grain even though it means having more synchronization (so possibly poor performance);
try to regroup tasks operating on distinct sub-array by checking that manually. You can cheat OpenMP by using fake dependencies so to do that or add more tasks synchronizations (typically a taskwait). You can also make use of advanced task-dependence-type modifiers like depobj, inoutset
or mutexinoutset now available in recent version of the OpenMP standard.
Most of the time, an algorithm is typically composed of a several waves of tasks and each wave is operating on non-overlapping sub-arrays. In this case, you can just add a task synchronization at the end of the computation (or use recursive tasks).
This is a known limitation of OpenMP and some researchers worked on it (including me). I expect new specific dependency modifiers to be added in the future, and possibly user-defined partitioning-related features in the long run (some research runtimes like StarPU does that for example, but not everyone agree adding that in OpenMP as this is a bit high-level and not simple to include).

Is omp barrier equivalent to omp end parallel in Fortran

My question is about synchronizing threads. Basically, if I have an OpenMP code in Fortran, each thread is doing something. There are two possibilities for synchronizing them (let some variable have the same value in each thread), I think.
add !$OMP BARRIER
add !$OMP END PARALLEL. If necessary, add !$OMP PARALLEL and !$OMP END PARALLEL block later on.
Are options 1) and 2) equivalent? I saw some question about barrier in nested threads omp barrier nested threads
So far I am more interseted in simpler scanarios with Fortran. E.g., for the code below, if I use barrier, it seems the two if (sum > 500) then conditions will behave the same, at least by gfortran.
PROGRAM test
USE OMP_LIB
integer :: numthreads, i, sum
numthreads = 2
sum = 0
call omp_set_num_threads(numthreads)
!$OMP PARALLEL
if (OMP_GET_THREAD_NUM() == 0) then
write (*,*) 'a'
do i = 1, 30
write (*,*) sum
sum = sum + i
end do
!write (*,*) 'sum', sum
else if (OMP_GET_THREAD_NUM() == 1) then
write (*,*) 'b'
do i = 1, 15
write (*,*) sum
sum = sum + i
end do
!write (*,*) 'sum', sum
end if
!$OMP BARRIER
if (sum > 500) then
write (*,*) 'sum v1'
else
write (*,*) 'not yet v1'
end if
!$OMP END PARALLEL
if (sum > 500) then
write (*,*) 'sum v2', sum
else
write (*,*) 'not yet v2', sum
end if
END
My concern is, for a code
blah1
!$OMP PARALLEL
!$OMP END PARALLEL
blah2
if the computer will execute as blah1 -> omp -> blah2. If the variables (e.g., the sum in the example code) in blah2 has been evaluated completely in the omp block, I don't need to worry if some thread in omp goes faster, compute part of an entry (e.g., sum in the question), and goes to the if condition in blah2 section, leads to some unexpected result.
No, they are not equivalent at all.
For !$omp end parallel let's think a little bit about how parallelism works within OpenMP. At the start of your program you just have a single so called master thread available. This remains the case until you reach a parallel region, within which you have multiple threads available, the master and (possibly) a number of others. In Fortran a parallel region is started with the !$omp parallel directive. It is closed by a !$omp end parallel directive, after which you just have the master thread available to your code until you start another parallel region. Thus !$omp end parallel simply marks the end of a parallel region.
Within a parallel region a number of OpenMP directives start to have an affect. One of these is !$omp barrier which requires that a given thread waits at that point in the code until all threads have reached that point (for a carefully chosen value of "all" when things like nested parallelism is in use - see the standard at https://www.openmp.org/spec-html/5.0/openmpsu90.html for more details). !$omp barrier has nothing to do with delimiting parallel regions. Thus after its use all threads are still available for use, and outside of a parallel region it will have no effect.
The following little code might help illustrate things
ijb#ijb-Latitude-5410:~/work/stack$ cat omp_bar.f90
Program omp_bar
!$ Use omp_lib, Only : omp_get_num_threads, omp_in_parallel
Implicit None
Integer n_th
!$omp parallel default( none ) private( n_th )
n_th = 1
!$ n_th = omp_get_num_threads()
Write( *, * ) 'Hello at 1 on ', n_th, ' threads. ', &
'Are we in a parallel region ?', omp_in_parallel()
!$omp barrier
Write( *, * ) 'Hello at 2', omp_in_parallel()
!$omp end parallel
Write( *, * ) 'Hello at 3', omp_in_parallel()
End Program omp_bar
ijb#ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -fopenmp -std=f2008 -Wall -Wextra -fcheck=all -O -g omp_bar.f90
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Hello at 1 on 2 threads. Are we in a parallel region ? T
Hello at 1 on 2 threads. Are we in a parallel region ? T
Hello at 2 T
Hello at 2 T
Hello at 3 F
[Yes, I know the barrier is not guaranteed to synchronise the output order, I got lucky here]

Critical section inside Parallel block in OpenMP

It seems to me that having a critical section withic a parallel block in OpenMP makes no sense! I might just well write a simple serial do loop right?
          
In the following (trivial) example instead of
           
1.
!$omp critical
!$ thread_num = omp_get_thread_num()
print *, "Hello world from thread number ", thread_num
!$omp end critical
!$omp end parallel
2.
do i=1,num_threads
print *, "Hello world from thread number ", thread_num
end do
That being said, I understand the difference: 1. uses different threads while 2. doesn't.
Is there a non-trivial context where the former might actually provide a speed advantage over the latter?
The $omp critical specifies that code is executed by one thread at a time. So your both examples not run parallel but in serial way. The sense of using critical section in clearly described on wiki so look there to find details (the typical situation is when all thread needs wait for some common value (calculated earlier in parallel way e.g. some sort of elements sum) to continue calculations)

Openmp: Have a MASTER construct inside parallel do

I have a fortran code that looks like this
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed)
do ix = 1,nx
! Do parallel work
do iy = 1,ny
! Do a lot of work....
!$OMP ATOMIC
numberOfCalculationsPerformed = numberOfCalculationsPerformed+1
!$OMP END ATOMIC
!$OMP MASTER
! Report progress
call progressCallBack(numberOfCalculationsPerformed/totalNCalculations)
!$OMP END MASTER
end do
end do
When I try to compile it reports that
error #7102: An OpenMP* MASTER directive is not permitted in the
dynamic extent of a DO, PARALLEL DO, SECTIONS, PARALLEL SECTIONS, or
SINGLE directive.
I do not understand this. I have tried to modify the parallel do construct to this
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed), &
!$OMP& SCHEDULE(STATIC)
(in the thought that it had something to do with the scheduling) but that did nothing to change the error.
Does anyone know what I am not getting right? Is it just impossible to use master inside a parallel do construct or what? If that is so, are there alternatives?
Edit:
!$OMP SINGLE
!$OMP END SINGLE
Instead of the MASTER equivalent yields the same result... (error message)
Ps. I only need one of the threads to execute progressCallback.
The question is a bit old, but since I recently stumbled across the same issue, I wanted to share a simple solution. The idea is to formulate an if-clause which only evaluates to TRUE for one of the threads. This can easily be achieved by querying the current thread number. By requiring it to be zero, the clause is guaranteed to be true for at least one thread:
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed)
do ix = 1,nx
! Do parallel work
do iy = 1,ny
! Do a lot of work....
!$OMP ATOMIC
numberOfCalculationsPerformed = numberOfCalculationsPerformed+1
!$OMP END ATOMIC
if (OMP_GET_THREAD_NUM() == 0) then
! Report progress
call progressCallBack(numberOfCalculationsPerformed/totalNCalculations)
end if
end do
end do

Assigning different thread numbers in OpenMP do-loops

I have two do-loops inside OpenMP parallel region as follows:
!$OMP PARALLEL
...
!$OMP DO
...
!$OMP END DO
...
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
Let's say OMP_NUM_THREADS=6. I wanted to run first do-loop with 4 threads, and the second do-loop with 3 threads. Can you show how to do it? I want them to be inside one parallel region though. Also is it possible to specify which thread numbers should do either of the do-loops, for example in case of first do-loop I could ask it to use thread numbers 1,2,4, and 5. Thanks.
Well, you can add the num_threads clause to an OpenMP parallel directive but that applies to any directive inside the region. In your case you could split your program into two regions, like
!$OMP PARALLEL DO num_threads(4)
...
!$OMP END PARALLEL DO
...
!$OMP PARALLEL DO num_threads(3)
...
!$OMP END PARALLEL DO
This, of course, is precisely what you say you don't want to do, have only one parallel region. But there is no mechanism for throttling the number of threads in use inside a parallel region. Personally I can't see why anyone would want to do that.
As for assigning parts of the computation to particular threads, again, no, OpenMP does not provide a mechanism for doing that and why would you want to ?
I suppose that I am dreadfully conventional, but when I see signs of parallel programs where the programmer has tried to take precise control over individual threads, I usually see a program with one or more of the following characteristics:
OpenMP directives are used to ensure that the code runs in serial with the result that run time exceeds that of the original serial code;
the program is incorrect because the programmer has failed to deal correctly with the subtleties of data races;
it has been carefully arranged to run only on a specific number of threads.
None of these is desirable in a parallel program and if you want the level of control over numbers of threads and the allocation of work to individual threads you will have to use a lower-level approach than OpenMP provides. Such approaches abound so giving up OpenMP should not limit you.
What you want cannot be achieved with the existing OpenMP constructs but only manually. Imagine that the original parallel loop was:
!$OMP DO
DO i = 1, 100
...
END DO
!$OMP END DO
The modified version with custom selection of the participating threads would be:
USE OMP_LIB
INTEGER, DIMENSION(:), ALLOCATABLE :: threads
INTEGER :: tid, i, imin, imax, tidx
! IDs of threads that should execute the loop
! Make sure no repeated items inside
threads = (/ 0, 1, 3, 4 /)
IF (MAXVAL(threads, 1) >= omp_get_max_threads()) THEN
STOP 'Error: insufficient number of OpenMP threads'
END IF
!$OMP PARALLEL PRIVATE(tid,i,imin,imax,tidx)
! Get current thread's ID
tid = omp_get_thread_num()
...
! Check if current thread should execute part of the loop
IF (ANY(threads == tid)) THEN
! Find out what thread's index is
tidx = MAXLOC(threads, 1, threads == tid)
! Compute iteration range based on the thread index
imin = 1 + ((100-1 + 1)*(tidx-1))/SIZE(threads)
imax = 1 + ((100-1 + 1)*tidx)/SIZE(threads) - 1
PRINT *, 'Thread', tid, imin, imax
DO i = imin, imax
...
END DO
ELSE
PRINT *, 'Thread', tid, 'not taking part'
END IF
! This simulates the barrier at the end of the worksharing construct
! Remove in order to implement the "nowait" clause
!$OMP BARRIER
...
!$OMP END PARALLEL
Here are three example executions:
$ OMP_NUM_THREADS=2 ./custom_loop.x | sort
STOP Error: insufficient number of OpenMP threads
$ OMP_NUM_THREADS=5 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
$ OMP_NUM_THREADS=7 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
Thread 5 not taking part
Thread 6 not taking part
Note that this is an awful hack and goes against the basic premises of the OpenMP model. I would strongly advise against doing it and relying on certain threads to execute certain portions of the code as it creates highly non-portable programs and hinders runtime optimisations.
If you decide to abandon the idea of explicitly assigning the threads that should execute the loop and only want to dynamically change the number of threads, then the chunk size parameter in the SCHEDULE clause is your friend:
!$OMP PARALLEL
...
! 2 threads = 10 iterations / 5 iterations/chunk
!$OMP DO SCHEDULE(static,5)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
! 10 threads = 10 iterations / 1 iteration/chunk
!$OMP DO SCHEDULE(static,1)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
!$OMP END PARALLEL
And the output with 10 threads:
$ OMP_NUM_THREADS=10 ./loop_chunks.x | sort_manually :)
First loop
Iteration Thread ID
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Second loop
Iteration Thread ID
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9