My question is about synchronizing threads. Basically, if I have an OpenMP code in Fortran, each thread is doing something. There are two possibilities for synchronizing them (let some variable have the same value in each thread), I think.
add !$OMP BARRIER
add !$OMP END PARALLEL. If necessary, add !$OMP PARALLEL and !$OMP END PARALLEL block later on.
Are options 1) and 2) equivalent? I saw some question about barrier in nested threads omp barrier nested threads
So far I am more interseted in simpler scanarios with Fortran. E.g., for the code below, if I use barrier, it seems the two if (sum > 500) then conditions will behave the same, at least by gfortran.
PROGRAM test
USE OMP_LIB
integer :: numthreads, i, sum
numthreads = 2
sum = 0
call omp_set_num_threads(numthreads)
!$OMP PARALLEL
if (OMP_GET_THREAD_NUM() == 0) then
write (*,*) 'a'
do i = 1, 30
write (*,*) sum
sum = sum + i
end do
!write (*,*) 'sum', sum
else if (OMP_GET_THREAD_NUM() == 1) then
write (*,*) 'b'
do i = 1, 15
write (*,*) sum
sum = sum + i
end do
!write (*,*) 'sum', sum
end if
!$OMP BARRIER
if (sum > 500) then
write (*,*) 'sum v1'
else
write (*,*) 'not yet v1'
end if
!$OMP END PARALLEL
if (sum > 500) then
write (*,*) 'sum v2', sum
else
write (*,*) 'not yet v2', sum
end if
END
My concern is, for a code
blah1
!$OMP PARALLEL
!$OMP END PARALLEL
blah2
if the computer will execute as blah1 -> omp -> blah2. If the variables (e.g., the sum in the example code) in blah2 has been evaluated completely in the omp block, I don't need to worry if some thread in omp goes faster, compute part of an entry (e.g., sum in the question), and goes to the if condition in blah2 section, leads to some unexpected result.
No, they are not equivalent at all.
For !$omp end parallel let's think a little bit about how parallelism works within OpenMP. At the start of your program you just have a single so called master thread available. This remains the case until you reach a parallel region, within which you have multiple threads available, the master and (possibly) a number of others. In Fortran a parallel region is started with the !$omp parallel directive. It is closed by a !$omp end parallel directive, after which you just have the master thread available to your code until you start another parallel region. Thus !$omp end parallel simply marks the end of a parallel region.
Within a parallel region a number of OpenMP directives start to have an affect. One of these is !$omp barrier which requires that a given thread waits at that point in the code until all threads have reached that point (for a carefully chosen value of "all" when things like nested parallelism is in use - see the standard at https://www.openmp.org/spec-html/5.0/openmpsu90.html for more details). !$omp barrier has nothing to do with delimiting parallel regions. Thus after its use all threads are still available for use, and outside of a parallel region it will have no effect.
The following little code might help illustrate things
ijb#ijb-Latitude-5410:~/work/stack$ cat omp_bar.f90
Program omp_bar
!$ Use omp_lib, Only : omp_get_num_threads, omp_in_parallel
Implicit None
Integer n_th
!$omp parallel default( none ) private( n_th )
n_th = 1
!$ n_th = omp_get_num_threads()
Write( *, * ) 'Hello at 1 on ', n_th, ' threads. ', &
'Are we in a parallel region ?', omp_in_parallel()
!$omp barrier
Write( *, * ) 'Hello at 2', omp_in_parallel()
!$omp end parallel
Write( *, * ) 'Hello at 3', omp_in_parallel()
End Program omp_bar
ijb#ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -fopenmp -std=f2008 -Wall -Wextra -fcheck=all -O -g omp_bar.f90
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Hello at 1 on 2 threads. Are we in a parallel region ? T
Hello at 1 on 2 threads. Are we in a parallel region ? T
Hello at 2 T
Hello at 2 T
Hello at 3 F
[Yes, I know the barrier is not guaranteed to synchronise the output order, I got lucky here]
Related
So, I have this simple Fortran do loop and inside that loop a couple of subroutines are called. I have
Made the do loop parallel with OpenMP, like this
!$omp parallel do
do i=1,n
call a()
call b()
enddo
!$omp end parallel do
Now most of the times the number of iterations in the loop is
less compared to the number of processor/threads available and the subroutines that are called inside the
loop can be called in parallel. So, is there a way to call the subroutines in parallel inside the parallel
do loop ? I have tried with task like this
!$omp parallel do
do i=1,n
!$omp task
call a(i , j )
!$omp end task
!$omp task
call b(i, k)
!$omp end task
!$omp taskwait
enddo
!$omp end parallel do
But this shows some error with segmentation fault. Is there any way to achieve this.
UPDATE:
So, I found out the main reason for the segmentation fault is coming from the fftw library. Lets consider a dummy program
program name
!$use omp_lib
implicit real*8(a-h,p-z)
call system_clock(count_rate=irate)
call system_clock(it1)
!$ call omp_set_nested(.true.)
!$omp parallel do
do i =1,5
call test(i)
print *, i
enddo
!$omp end parallel do
call system_clock(it2)
print *, (it2-it1)/real(irate, kind=8)
end program name
subroutine test(ii)
! just a dummy subroutine for heavy computation
implicit real*8(a-h,p-z)
do j=1,40000
!$omp task
do k=1,40000
x = exp(sqrt(sqrt(2.0d0*ii**3)**2))
enddo
!$omp end task
enddo
end subroutine
This program works exactly what I wants and using the task directives, uses the remaining threads and improves the performance. Now lets consider another dummy program but with fftw, similar to what I'm working.
program name
!$use omp_lib
implicit real*8(a-h,p-z)
integer, parameter :: n=8192*8
complex(kind=8) :: arr(n)
real(kind=8) :: tmp1(n), tmp2(n)
integer(kind=8) :: pF
integer :: i
call system_clock(count_rate=irate)
call dfftw_plan_dft_1d(pF,n,arr,arr,-1,0) ! forward
call system_clock(it1)
!$ call omp_set_nested(.true.)
!$omp parallel do private(arr)
do i =1,5
call random_number(tmp1)
call random_number(tmp2)
arr = cmplx(tmp1, tmp2, kind=8)
call test(pF, arr)
print *, i
enddo
!$omp end parallel do
call system_clock(it2)
print *, (it2-it1)/real(irate, kind=8)
end program name
subroutine test(pF, arr)
implicit real*8(a-h,p-z)
complex(kind=8) :: arr(:)
integer(kind=8) :: pF
do j=1,100
!$omp task private(arr)
do k=1, 100
call dfftw_execute_dft(pF, arr, arr)
enddo
!$omp end task
enddo
end subroutine
Now, this throws the segmentation fault. (NOTE: I have no random numer call in my actual program, they are here just for a dummy purpose). I have checked http://www.fftw.org/fftw3_doc/Thread-safety.html and fftw_execute is thread safe and the program works without the task directives. But with the task it throws error. Anyone knows how to fix this ?
Sigh, yet another example of why !$omp do parallel is a bad idea ... I really do think it is best to clearly separate the thread creation and worksharing phases.
As Vladimir says in the comments you haven't provided nearly enough detail to tell why you are getting a segmentation fault. However you seem to have a few misconceptions about OpenMP which I can try to address.
Firstly a very quick and dirty way to achieve what you want and avoiding any extra OpenMP directives is
!$omp parallel default( none ) private( i ) shared( n ) ! Create threads
!$omp do ! Now share out the work
Do i = 1, 2 * n
If( Mod( i, 2 ) == 1 ) Then
Call a
Else
Call b
End Do
!$omp end do
!$omp end parallel
However if you want to use tasks you're probably not doing it the easiest way if all calls to a and b are completely independent. In that case remember that a new task is created whenever ANY thread hits a !$omp task, and that that task can be executed by any thread, not just the one that created it. Following that logic something like
!$omp parallel default( none ) private( i ) shared( n ) ! Crate the threads
!$omp single
Do i = 1, n
!$omp task
Call a
!$omp end task
!$omp task
call b
!$omp end task
end do
!$omp end single
!$omp end parallel
is what you want - you use one thread to create the list of tasks, and then (or more probably while the list is being created) all the available threads will execute them, each task being taken by the next available thread. Note I have also missed out the taskwait directive as from your description I'm not sure why you think you need it as I can see no need for synchronisation at that point.
I have a fortran code that looks like this
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed)
do ix = 1,nx
! Do parallel work
do iy = 1,ny
! Do a lot of work....
!$OMP ATOMIC
numberOfCalculationsPerformed = numberOfCalculationsPerformed+1
!$OMP END ATOMIC
!$OMP MASTER
! Report progress
call progressCallBack(numberOfCalculationsPerformed/totalNCalculations)
!$OMP END MASTER
end do
end do
When I try to compile it reports that
error #7102: An OpenMP* MASTER directive is not permitted in the
dynamic extent of a DO, PARALLEL DO, SECTIONS, PARALLEL SECTIONS, or
SINGLE directive.
I do not understand this. I have tried to modify the parallel do construct to this
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed), &
!$OMP& SCHEDULE(STATIC)
(in the thought that it had something to do with the scheduling) but that did nothing to change the error.
Does anyone know what I am not getting right? Is it just impossible to use master inside a parallel do construct or what? If that is so, are there alternatives?
Edit:
!$OMP SINGLE
!$OMP END SINGLE
Instead of the MASTER equivalent yields the same result... (error message)
Ps. I only need one of the threads to execute progressCallback.
The question is a bit old, but since I recently stumbled across the same issue, I wanted to share a simple solution. The idea is to formulate an if-clause which only evaluates to TRUE for one of the threads. This can easily be achieved by querying the current thread number. By requiring it to be zero, the clause is guaranteed to be true for at least one thread:
!$OMP PARALLEL DO DEFAULT(PRIVATE) SHARED(var1, var2, var3, numberOfCalculationsPerformed)
do ix = 1,nx
! Do parallel work
do iy = 1,ny
! Do a lot of work....
!$OMP ATOMIC
numberOfCalculationsPerformed = numberOfCalculationsPerformed+1
!$OMP END ATOMIC
if (OMP_GET_THREAD_NUM() == 0) then
! Report progress
call progressCallBack(numberOfCalculationsPerformed/totalNCalculations)
end if
end do
end do
To practice parallelizing the do loop, I am doing the following integral in Fortran
$\integral{0}{1} \frac{4}{1+x^{2}} = \pi$
The following is the code that I implemented:
program mpintegrate
integer i,nmax,nthreads,OMP_GET_NUM_THREADS
real xn,dx,value
real X(100000)
nthreads = 4
nmax = 100000
xn = 0.0
dx = 1.0/nmax
value = 0.0
do i=1,nmax
X(i) = xn
xn = xn + dx
enddo
call OMP_SET_NUM_THREADS(nthreads)
!$OMP Parallel
!$OMP Do Schedule(Static) Private(i,X)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP End DO NoWait
!$OMP End Parallel
print *, value
end
I have no problems compiling the program
gfortran -fopenmp -o mpintegrate mpintegrate.f
The problem is when I execute the program. When I run the program as is, I get values ranging from (1,4). However, when I uncomment the print statement withing the omp do loop, the final value is around what it should be, pi.
Why is the answer in value incorrect?
One problem here is that X needs to not be private (and which needs to be specified on the parallel line, not the do line); everyone needs to see it, and there's no point in having separate copies for each thread. Worse, the results you get from accessing the private copy here is undefined, as that private variable hasn't been initialized once you get into the private region. You could use firstprivate rather than private, which initializes it for you with what was there before the parallel region, but easiest/best here is just shared.
There's also not much point in having the end do be no wait, as the end parallel has to wait for everyone to be done anyway.
However, that being said, you still have a pretty major (and classic) correctness problem. What's happening here is clearer if you're a little more explicit in the loop (dropping the schedule for clarity since the issue doesn't depend on the schedule chosen):
!$OMP Parallel do Private(i) Default(none) Shared(value,X,dx,nmax)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP End Parallel Do
print *, value
Running this repeatedly gives different values:
$ ./foo
1.6643878
$ ./foo
1.5004054
$ ./foo
1.2746993
The problem is that all of the threads are writing to the same shared variable value. This is wrong - everyone is writing at once and the result is gibberish, as a thread can calculate it's own contribution, get ready to add it to value, and just as it's about to, another thread can do its writing to value, which then gets promptly clobbered. Concurrent writes to the same shared variable is a classic race condition, a standard family of bugs that happen particularly often in shared-memory programming like with OpenMP.
In addition to being wrong, it's slow. A number of threads contending for the same few bytes of memory - memory close enough together to fall in the same cache line - can be very slow because of contention in the memory system. Even if they aren't exactly the same variable (as they are in this case), this memory contention - False Sharing in the case that they only happen to be neighbouring variables - can significantly slow things down. Taking out the explicit thread-number setting, and using environment variables:
$ export OMP_NUM_THREADS=1
$ time ./foo
3.1407621
real 0m0.003s
user 0m0.001s
sys 0m0.001s
$ export OMP_NUM_THREADS=2
$ time ./foo
3.1224852
real 0m0.007s
user 0m0.012s
sys 0m0.000s
$ export OMP_NUM_THREADS=8
$ time ./foo
1.1651508
real 0m0.008s
user 0m0.042s
sys 0m0.000s
So things get almost 3 times slower (and increasingly wronger) running with more threads.
So what can we do to fix this? One thing we could to is make sure that everyone's additions aren't overwriting each other, with the atomic directive:
!$OMP Parallel do Schedule(Static) Private(i) Default(none) Shared(X,dx, value, nmax)
do i=1,nmax
!$OMP atomic
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
which solves the correctness problem:
$ export OMP_NUM_THREADS=8
$ ./foo
3.1407621
but does nothing for the speed problem:
$ export OMP_NUM_THREADS=1
$ time ./foo
3.1407621
real 0m0.004s
user 0m0.001s
sys 0m0.002s
$ export OMP_NUM_THREADS=2
$ time ./foo
3.1407738
real 0m0.014s
user 0m0.023s
sys 0m0.001s
(Note you get slightly different answers with different numbers of threads. This is due to the final sum being calculated in a different order than in the serial case. With single precision reals, differences showing up in the 7th digit due to different ordering of operations is hard to avoid, and here we're doing 100,000 operations.)
So what else could we do? One approach is for everyone to keep track of their own partial sums, and then sum them all together when we're done:
!...
integer, parameter :: nthreads = 4
integer, parameter :: space=8
integer :: threadno
real, dimension(nthreads*space) :: partials
!...
partials=0
!...
!$OMP Parallel Private(value,i,threadno) Default(none) Shared(X,dx, partials)
value = 0
threadno = omp_get_thread_num()
!$OMP DO
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP END DO
partials((threadno+1)*space) = value
!$OMP end parallel
value = sum(partials)
print *, value
end
This works - we get the right answer, and if you play with the number of threads, you'll find it's pretty zippy - we've spaced out the entries in the partial sums array to avoid false sharing (and it is false, this time, as everyone is writing to a different entry in the array - no overwriting).
Still, this is a silly amount of work just to get a sum correct across threads! There's a simpler way to do this - OpenMP has a reduction construct to do this automatically (and more efficiently than this handmade version above:)
!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
print *, value
and now the program works correctly, is fast, and the code is fairly simple. The final code, in more modern Fortran, looks something like this:
program mpintegrate
use omp_lib
integer, parameter :: nmax = 100000
real :: xn,dx,value
real :: X(nmax)
integer :: i
integer, parameter :: nthreads = 4
xn = 0.0
dx = 1.0/nmax
value = 0.0
partials=0
do i=1,nmax
X(i) = xn
xn = xn + dx
enddo
call omp_set_num_threads(nthreads)
!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
print *, value
end
I have legacy fortran source file named pot.f,
which I need to apply OpenMP to parallel as shown below,but I can error messages about unexpected end state etc. But when I comment out $OMP lines by adding additional ! in the first column, there are not errors.
It is really weird to me. Can anybody tell me what went wrong?
subroutine pot_osc(rvp,R_pot,e_pot,pe_pot,ftmp,gtmp,vtmp,natoms)
implicit none
include 'sizes.h'
include 'constants.h'
include 'omp_lib.h'
double precision ftmp(maxatoms,3),gtmp(3),R_pot(maxatoms,3)
!$OMP PARALLEL WORKSHARE SHARED(gtmp,ftmp)
!$OMP PARALLEL NUM_THREADS(16)
gtmp = 0d0
ftmp = 0d0
!$OMP END PARALLEL WORKSHARE
return
end
subroutine pot_asym(rvp,vtmp)
implicit none
include 'constants.h'
return
end
Error messages:
end
1
Error: Unexpected END statement at (1)
subroutine pot_asym(rvp,vtmp)
1
Error: Unclassifiable statement at (1)
You start a second parallel section in the second OpenMP directive, which is not terminated by an end parallel. So the OpenMP directive should read
!$OMP PARALLEL WORKSHARE SHARED(gtmp,ftmp) NUM_THREADS(16)
gtmp = 0d0
ftmp = 0d0
!$OMP END PARALLEL WORKSHARE
or if you like to keep the line break use
!$OMP PARALLEL WORKSHARE SHARED(gtmp,ftmp) &
!$OMP NUM_THREADS(16)
gtmp = 0d0
ftmp = 0d0
!$OMP END PARALLEL WORKSHARE
In the past, I experienced some problems with exactly this kind of initialization. It seems that when compiled with gfortran the master thread did all the work. Even worse, by means of the "first-stouch principle", the whole array was located in the memory associated with the first thread. On our CCNUMA machine this lead to a huge slowdown.
To solve this I used explicit loops to initialize:
!$OMP PARALLEL DO SHARED(gtmp,ftmp) NUM_THREADS(16)
do i=1,maxatoms
ftmp(i,:) = 0d0
enddo
!$OMP END PARALLEL DO
! No need to do three elements in parallel
gtmp = 0d0
I don't know whether they fixed this problem, but I use this way of initialization for arrays in shared memory since then.
I have two do-loops inside OpenMP parallel region as follows:
!$OMP PARALLEL
...
!$OMP DO
...
!$OMP END DO
...
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
Let's say OMP_NUM_THREADS=6. I wanted to run first do-loop with 4 threads, and the second do-loop with 3 threads. Can you show how to do it? I want them to be inside one parallel region though. Also is it possible to specify which thread numbers should do either of the do-loops, for example in case of first do-loop I could ask it to use thread numbers 1,2,4, and 5. Thanks.
Well, you can add the num_threads clause to an OpenMP parallel directive but that applies to any directive inside the region. In your case you could split your program into two regions, like
!$OMP PARALLEL DO num_threads(4)
...
!$OMP END PARALLEL DO
...
!$OMP PARALLEL DO num_threads(3)
...
!$OMP END PARALLEL DO
This, of course, is precisely what you say you don't want to do, have only one parallel region. But there is no mechanism for throttling the number of threads in use inside a parallel region. Personally I can't see why anyone would want to do that.
As for assigning parts of the computation to particular threads, again, no, OpenMP does not provide a mechanism for doing that and why would you want to ?
I suppose that I am dreadfully conventional, but when I see signs of parallel programs where the programmer has tried to take precise control over individual threads, I usually see a program with one or more of the following characteristics:
OpenMP directives are used to ensure that the code runs in serial with the result that run time exceeds that of the original serial code;
the program is incorrect because the programmer has failed to deal correctly with the subtleties of data races;
it has been carefully arranged to run only on a specific number of threads.
None of these is desirable in a parallel program and if you want the level of control over numbers of threads and the allocation of work to individual threads you will have to use a lower-level approach than OpenMP provides. Such approaches abound so giving up OpenMP should not limit you.
What you want cannot be achieved with the existing OpenMP constructs but only manually. Imagine that the original parallel loop was:
!$OMP DO
DO i = 1, 100
...
END DO
!$OMP END DO
The modified version with custom selection of the participating threads would be:
USE OMP_LIB
INTEGER, DIMENSION(:), ALLOCATABLE :: threads
INTEGER :: tid, i, imin, imax, tidx
! IDs of threads that should execute the loop
! Make sure no repeated items inside
threads = (/ 0, 1, 3, 4 /)
IF (MAXVAL(threads, 1) >= omp_get_max_threads()) THEN
STOP 'Error: insufficient number of OpenMP threads'
END IF
!$OMP PARALLEL PRIVATE(tid,i,imin,imax,tidx)
! Get current thread's ID
tid = omp_get_thread_num()
...
! Check if current thread should execute part of the loop
IF (ANY(threads == tid)) THEN
! Find out what thread's index is
tidx = MAXLOC(threads, 1, threads == tid)
! Compute iteration range based on the thread index
imin = 1 + ((100-1 + 1)*(tidx-1))/SIZE(threads)
imax = 1 + ((100-1 + 1)*tidx)/SIZE(threads) - 1
PRINT *, 'Thread', tid, imin, imax
DO i = imin, imax
...
END DO
ELSE
PRINT *, 'Thread', tid, 'not taking part'
END IF
! This simulates the barrier at the end of the worksharing construct
! Remove in order to implement the "nowait" clause
!$OMP BARRIER
...
!$OMP END PARALLEL
Here are three example executions:
$ OMP_NUM_THREADS=2 ./custom_loop.x | sort
STOP Error: insufficient number of OpenMP threads
$ OMP_NUM_THREADS=5 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
$ OMP_NUM_THREADS=7 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
Thread 5 not taking part
Thread 6 not taking part
Note that this is an awful hack and goes against the basic premises of the OpenMP model. I would strongly advise against doing it and relying on certain threads to execute certain portions of the code as it creates highly non-portable programs and hinders runtime optimisations.
If you decide to abandon the idea of explicitly assigning the threads that should execute the loop and only want to dynamically change the number of threads, then the chunk size parameter in the SCHEDULE clause is your friend:
!$OMP PARALLEL
...
! 2 threads = 10 iterations / 5 iterations/chunk
!$OMP DO SCHEDULE(static,5)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
! 10 threads = 10 iterations / 1 iteration/chunk
!$OMP DO SCHEDULE(static,1)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
!$OMP END PARALLEL
And the output with 10 threads:
$ OMP_NUM_THREADS=10 ./loop_chunks.x | sort_manually :)
First loop
Iteration Thread ID
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Second loop
Iteration Thread ID
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9