OpenMP running slower with more threads [duplicate]

OpenMP running slower with more threads [duplicate] - fortran

This question already has answers here:
All my codes are running much slower when I use open mp
(1 answer)
Fortran + Openmp more slow that sequential
(3 answers)
Fortran OpenMP program shows no speedup of CPU_TIME()
(1 answer)
Closed 1 year ago.
It seems many people have similar issues with OpenMP but I couldn't find a solution to my problem.
I'm using a simple code:
PROGRAM Parallel_Hello_World
USE OMP_LIB
integer :: i
real :: start, finish, B
call cpu_time(start)
!$OMP PARALLEL
!$OMP DO
DO i = 1,2000000
B = cos(cos(cos(sin(cos(sqrt(sqrt(sqrt( cos( real(i) ) ))))))))
B = cos(cos(cos(sin(cos(sqrt(sqrt(sqrt( cos( B ) ))))))))
B = cos(cos(cos(sin(cos(sqrt(sqrt(sqrt( cos( B ) ))))))))
END DO
!$OMP END DO
!$OMP END PARALLEL
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
END
I'm confused where the overhead is coming from. Even when I increase the amount of sin/cos/sqrt operations, lower threads always wins.
export OMP_NUM_THREADS=1
Time = 1.58 seconds. (average)
export OMP_NUM_THREADS=8
Time = 2.376 seconds. (average)
Compile:
ifort para.f90 -o para.exe -qopenmp -O2
the Intel compiler is from 2020.

Related

Is omp barrier equivalent to omp end parallel in Fortran

My question is about synchronizing threads. Basically, if I have an OpenMP code in Fortran, each thread is doing something. There are two possibilities for synchronizing them (let some variable have the same value in each thread), I think.
add !$OMP BARRIER
add !$OMP END PARALLEL. If necessary, add !$OMP PARALLEL and !$OMP END PARALLEL block later on.
Are options 1) and 2) equivalent? I saw some question about barrier in nested threads omp barrier nested threads
So far I am more interseted in simpler scanarios with Fortran. E.g., for the code below, if I use barrier, it seems the two if (sum > 500) then conditions will behave the same, at least by gfortran.
PROGRAM test
USE OMP_LIB
integer :: numthreads, i, sum
numthreads = 2
sum = 0
call omp_set_num_threads(numthreads)
!$OMP PARALLEL
if (OMP_GET_THREAD_NUM() == 0) then
write (*,*) 'a'
do i = 1, 30
write (*,*) sum
sum = sum + i
end do
!write (*,*) 'sum', sum
else if (OMP_GET_THREAD_NUM() == 1) then
write (*,*) 'b'
do i = 1, 15
write (*,*) sum
sum = sum + i
end do
!write (*,*) 'sum', sum
end if
!$OMP BARRIER
if (sum > 500) then
write (*,*) 'sum v1'
else
write (*,*) 'not yet v1'
end if
!$OMP END PARALLEL
if (sum > 500) then
write (*,*) 'sum v2', sum
else
write (*,*) 'not yet v2', sum
end if
END
My concern is, for a code
blah1
!$OMP PARALLEL
!$OMP END PARALLEL
blah2
if the computer will execute as blah1 -> omp -> blah2. If the variables (e.g., the sum in the example code) in blah2 has been evaluated completely in the omp block, I don't need to worry if some thread in omp goes faster, compute part of an entry (e.g., sum in the question), and goes to the if condition in blah2 section, leads to some unexpected result.

No, they are not equivalent at all.
For !$omp end parallel let's think a little bit about how parallelism works within OpenMP. At the start of your program you just have a single so called master thread available. This remains the case until you reach a parallel region, within which you have multiple threads available, the master and (possibly) a number of others. In Fortran a parallel region is started with the !$omp parallel directive. It is closed by a !$omp end parallel directive, after which you just have the master thread available to your code until you start another parallel region. Thus !$omp end parallel simply marks the end of a parallel region.
Within a parallel region a number of OpenMP directives start to have an affect. One of these is !$omp barrier which requires that a given thread waits at that point in the code until all threads have reached that point (for a carefully chosen value of "all" when things like nested parallelism is in use - see the standard at https://www.openmp.org/spec-html/5.0/openmpsu90.html for more details). !$omp barrier has nothing to do with delimiting parallel regions. Thus after its use all threads are still available for use, and outside of a parallel region it will have no effect.
The following little code might help illustrate things
ijb#ijb-Latitude-5410:~/work/stack$ cat omp_bar.f90
Program omp_bar
!$ Use omp_lib, Only : omp_get_num_threads, omp_in_parallel
Implicit None
Integer n_th
!$omp parallel default( none ) private( n_th )
n_th = 1
!$ n_th = omp_get_num_threads()
Write( *, * ) 'Hello at 1 on ', n_th, ' threads. ', &
'Are we in a parallel region ?', omp_in_parallel()
!$omp barrier
Write( *, * ) 'Hello at 2', omp_in_parallel()
!$omp end parallel
Write( *, * ) 'Hello at 3', omp_in_parallel()
End Program omp_bar
ijb#ijb-Latitude-5410:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ijb#ijb-Latitude-5410:~/work/stack$ gfortran -fopenmp -std=f2008 -Wall -Wextra -fcheck=all -O -g omp_bar.f90
ijb#ijb-Latitude-5410:~/work/stack$ ./a.out
Hello at 1 on 2 threads. Are we in a parallel region ? T
Hello at 1 on 2 threads. Are we in a parallel region ? T
Hello at 2 T
Hello at 2 T
Hello at 3 F
[Yes, I know the barrier is not guaranteed to synchronise the output order, I got lucky here]

Variable locality in OpenMP Reduction with Fortran [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
This is a part of my Fortran code.
iabcd=0
!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(icheck,iv,HO,tnljm,H)
!$OMP DO SCHEDULE(DYNAMIC) REDUCTION(+: iabcd)
do ia=1,HO%NLEV
do ib=ia+1,HO%NLEV
do ic=1,HO%NLEV
do id=ic+1,HO%NLEV
if(tnljm%t(ia)+tnljm%t(ib) .ne. tnljm%t(ic)+tnljm%t(id)) cycle
iabcd = iabcd + 1
H%ka(iabcd) = ia
H%kb(iabcd) = ib
H%kc(iabcd) = ic
H%kd(iabcd) = id
H%ME2BM(iabcd) = 0.d0
enddo
enddo
enddo
enddo
!$OMP END DO
!$OMP END PARALLEL
I can run the code without any warning. But, the results are weird and different from the results by not using the OpenMP. What's the problem with the code? Thanks.

When using OMP DO REDUCTION(+:iabc), each thread creates its own private variable iabc and uses that one within the loop, only after the parallel region has been handled, the addition will take place.
That is, the access
H%ka(iabcd) = ia
H%kb(iabcd) = ib
H%kc(iabcd) = ic
H%kd(iabcd) = id
H%ME2BM(iabcd) = 0.d0
only uses the local, private version of iabc, which is not the same behaviour as in the serial version of the code.
What you instead can do is to use OMP CRITICAL for the update of iabc, as you want all threads to use the same version of iabc (and of course, also make iabc shared). You then also have to create a private copy of iabc inside the critical region, such that the update of H happens with the correct iabc. This will however decrease the efficiency of the parallelization, unless tnljm%t(ia)+tnljm%t(ib) .ne. tnljm%t(ic)+tnljm%t(id) is almost always true.

Bad performance of parallel subroutine

I was trying to parallelize the following code; however, when it was executed on the main program, there didn't seem to be significant speed-up. I tested the same subroutine on another program, and it took even longer time to run than the serial code.
SUBROUTINE rotate(r,qt,n,np,i,a,b)
IMPLICIT NONE
INTEGER n,np,i
DOUBLE PRECISION a,b,r(np,np),qt(np,np)
INTEGER j
DOUBLE PRECISION c,fact,s,w,y
if(a.eq.0.d0)then
c=0.d0
s=sign(1.d0,b)
else if(abs(a).gt.abs(b))then
fact=b/a
c=sign(1.d0/sqrt(1.d0+fact**2),a)
s=fact*c
else
fact=a/b
s=sign(1.d0/sqrt(1.d0+fact**2),b)
c=fact*s
endif
!$omp parallel shared(i,n,c,s,r,qt) private(y,w,j)
!$omp do schedule(static,2)
do 11 j=i,n
y=r(i,j)
w=r(i+1,j)
r(i,j)=c*y-s*w
r(i+1,j)=s*y+c*w
11 continue
!$omp do schedule(static,2)
do 12 j=1,n
y=qt(i,j)
w=qt(i+1,j)
qt(i,j)=c*y-s*w
qt(i+1,j)=s*y+c*w
12 continue
!$omp end parallel
return
END
C (C) Copr. 1986-92 Numerical Recipes Software Vs94z&):9+X%1j49#:`*.
However when I used the built-in function in Linux to measure the time, i got:
real 0m12.160s
user 4m49.894s
sys 0m0.880s
which is ridiculous compared to the time of the serial code:
real 0m2.078s
user 0m2.068s
sys 0m0.000s

So you have something like
do i=1,n
do j=1,n
do k=1,n
call rotate()
end do
end do
end do
for n = 100 and you are parallelizing two simple loops inside rotate.
That is hopeless. If you want decent performance, you must parallelize the outermost loop that is possible.
There is simply not enough work inside the loops inside rotate and it is called too many times. You call it 1000000 times so the threads must be synchronised or re-launched 2000000 times. That takes all of your run time. All the run time increase you see is this synchronization.

Two openmp ordered blocks with no resulting parallelization

I am writing a Fortran program that needs to have reproducible results (for publication). My understanding of the following code is that it should be reproducible.
program main
implicit none
real(8) :: ybest,xbest,x,y
integer :: i
ybest = huge(0d0)
!$omp parallel do ordered private(x,y) shared(ybest,xbest) schedule(static,1)
do i = 1,10
!$omp ordered
!$omp critical
call random_number(x)
!$omp end critical
!$omp end ordered
! Do a lot of work
call sleep(1)
y = -1d0
!$omp ordered
!$omp critical
if (y<ybest) then
ybest = y
xbest = x
end if
!$omp end critical
!$omp end ordered
end do
!$omp end parallel do
end program
In my case, there is a function in place of "sleep" that takes long time to compute, and I want it done in parallel. According to OpenMP standards, should sleep in this example execute in parallel? I thought it should be (based on this How does the omp ordered clause work?), but with gfortran 5.2.0 (mac) and gfortran 5.1.0 (linux) it is not executing in parallel (at least, there is no speedup from it). The timing results are below.
Also, my guess is the critical statements are not necessary, but I wasn't completely sure.
Thanks.
-Edit-
In response to Vladmir's comments, I added a full working program with timing results.
#!/bin/bash
mpif90 main.f90
time ./a.out
mpif90 main.f90 -fopenmp
time ./a.out
The code runs as
real 0m10.047s
user 0m0.003s
sys 0m0.003s
real 0m10.037s
user 0m0.003s
sys 0m0.004s
BUT, if you comment out the ordered blocks, it runs with the following times:
real 0m10.044s
user 0m0.002s
sys 0m0.003s
real 0m3.021s
user 0m0.002s
sys 0m0.004s
Edit -
In response to innoSPG, here are the results for a non-trivial function in place of sleep:
real(8) function f(x)
implicit none
real(8), intent(in) :: x
! local
real(8) :: tmp
integer :: i
tmp = 0d0
do i = 1,10000000
tmp = tmp + cos(sin(x))/real(i,8)
end do
f = tmp
end function
real 0m2.229s --- no openmp
real 0m2.251s --- with openmp and ordered
real 0m0.773s --- with openmp but ordered commented out

This program is non-conforming to the OpenMP standard. Specifically, the problem is that you have more than one ordered region and every iteration of your loop will execute both of them. The OpenMP 4.0 standard has this to say (2.12.8, Restrictions, line 16, p 139):
During execution of an iteration of a loop or a loop nest within a loop region, a thread must not execute more than one ordered region that binds to the same loop
region.
If you have more than one ordered region, you must have conditional code paths such that only one of them can be executed for any loop iteration.
It is also worth noting the position of your ordered region seems to have performance implications. Testing with gfortran 5.2, it appears everything after the ordered region is executed in order for each loop iteration, so having the ordered block at the beginning of the loop leads to serial performance while having the ordered block at the end of the loop does not have this implication as the code before the block is parallelized. Testing with ifort 15 is not as dramatic but I would still recommend structuring your code so your ordered block occurs after any code than needs parallelization in a loop iteration rather than before.

Threads summing a variable giving wrong answer in OpenMP

To practice parallelizing the do loop, I am doing the following integral in Fortran
$\integral{0}{1} \frac{4}{1+x^{2}} = \pi$
The following is the code that I implemented:
program mpintegrate
integer i,nmax,nthreads,OMP_GET_NUM_THREADS
real xn,dx,value
real X(100000)
nthreads = 4
nmax = 100000
xn = 0.0
dx = 1.0/nmax
value = 0.0
do i=1,nmax
X(i) = xn
xn = xn + dx
enddo
call OMP_SET_NUM_THREADS(nthreads)
!$OMP Parallel
!$OMP Do Schedule(Static) Private(i,X)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP End DO NoWait
!$OMP End Parallel
print *, value
end
I have no problems compiling the program
gfortran -fopenmp -o mpintegrate mpintegrate.f
The problem is when I execute the program. When I run the program as is, I get values ranging from (1,4). However, when I uncomment the print statement withing the omp do loop, the final value is around what it should be, pi.
Why is the answer in value incorrect?

One problem here is that X needs to not be private (and which needs to be specified on the parallel line, not the do line); everyone needs to see it, and there's no point in having separate copies for each thread. Worse, the results you get from accessing the private copy here is undefined, as that private variable hasn't been initialized once you get into the private region. You could use firstprivate rather than private, which initializes it for you with what was there before the parallel region, but easiest/best here is just shared.
There's also not much point in having the end do be no wait, as the end parallel has to wait for everyone to be done anyway.
However, that being said, you still have a pretty major (and classic) correctness problem. What's happening here is clearer if you're a little more explicit in the loop (dropping the schedule for clarity since the issue doesn't depend on the schedule chosen):
!$OMP Parallel do Private(i) Default(none) Shared(value,X,dx,nmax)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP End Parallel Do
print *, value
Running this repeatedly gives different values:
$ ./foo
1.6643878
$ ./foo
1.5004054
$ ./foo
1.2746993
The problem is that all of the threads are writing to the same shared variable value. This is wrong - everyone is writing at once and the result is gibberish, as a thread can calculate it's own contribution, get ready to add it to value, and just as it's about to, another thread can do its writing to value, which then gets promptly clobbered. Concurrent writes to the same shared variable is a classic race condition, a standard family of bugs that happen particularly often in shared-memory programming like with OpenMP.
In addition to being wrong, it's slow. A number of threads contending for the same few bytes of memory - memory close enough together to fall in the same cache line - can be very slow because of contention in the memory system. Even if they aren't exactly the same variable (as they are in this case), this memory contention - False Sharing in the case that they only happen to be neighbouring variables - can significantly slow things down. Taking out the explicit thread-number setting, and using environment variables:
$ export OMP_NUM_THREADS=1
$ time ./foo
3.1407621
real 0m0.003s
user 0m0.001s
sys 0m0.001s
$ export OMP_NUM_THREADS=2
$ time ./foo
3.1224852
real 0m0.007s
user 0m0.012s
sys 0m0.000s
$ export OMP_NUM_THREADS=8
$ time ./foo
1.1651508
real 0m0.008s
user 0m0.042s
sys 0m0.000s
So things get almost 3 times slower (and increasingly wronger) running with more threads.
So what can we do to fix this? One thing we could to is make sure that everyone's additions aren't overwriting each other, with the atomic directive:
!$OMP Parallel do Schedule(Static) Private(i) Default(none) Shared(X,dx, value, nmax)
do i=1,nmax
!$OMP atomic
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
which solves the correctness problem:
$ export OMP_NUM_THREADS=8
$ ./foo
3.1407621
but does nothing for the speed problem:
$ export OMP_NUM_THREADS=1
$ time ./foo
3.1407621
real 0m0.004s
user 0m0.001s
sys 0m0.002s
$ export OMP_NUM_THREADS=2
$ time ./foo
3.1407738
real 0m0.014s
user 0m0.023s
sys 0m0.001s
(Note you get slightly different answers with different numbers of threads. This is due to the final sum being calculated in a different order than in the serial case. With single precision reals, differences showing up in the 7th digit due to different ordering of operations is hard to avoid, and here we're doing 100,000 operations.)
So what else could we do? One approach is for everyone to keep track of their own partial sums, and then sum them all together when we're done:
!...
integer, parameter :: nthreads = 4
integer, parameter :: space=8
integer :: threadno
real, dimension(nthreads*space) :: partials
!...
partials=0
!...
!$OMP Parallel Private(value,i,threadno) Default(none) Shared(X,dx, partials)
value = 0
threadno = omp_get_thread_num()
!$OMP DO
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP END DO
partials((threadno+1)*space) = value
!$OMP end parallel
value = sum(partials)
print *, value
end
This works - we get the right answer, and if you play with the number of threads, you'll find it's pretty zippy - we've spaced out the entries in the partial sums array to avoid false sharing (and it is false, this time, as everyone is writing to a different entry in the array - no overwriting).
Still, this is a silly amount of work just to get a sum correct across threads! There's a simpler way to do this - OpenMP has a reduction construct to do this automatically (and more efficiently than this handmade version above:)
!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
print *, value
and now the program works correctly, is fast, and the code is fairly simple. The final code, in more modern Fortran, looks something like this:
program mpintegrate
use omp_lib
integer, parameter :: nmax = 100000
real :: xn,dx,value
real :: X(nmax)
integer :: i
integer, parameter :: nthreads = 4
xn = 0.0
dx = 1.0/nmax
value = 0.0
partials=0
do i=1,nmax
X(i) = xn
xn = xn + dx
enddo
call omp_set_num_threads(nthreads)
!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
print *, value
end

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMP running slower with more threads [duplicate] - fortran

Related

Is omp barrier equivalent to omp end parallel in Fortran

Variable locality in OpenMP Reduction with Fortran [closed]

Bad performance of parallel subroutine

Two openmp ordered blocks with no resulting parallelization

Threads summing a variable giving wrong answer in OpenMP

Categories

Resources