Results of parallel program with nested loops differ from serial program - fortran

I would like to use OpenMP for this single thread code:
PROGRAM SINGLE
INTEGER, DIMENSION(30000)::SUMGRM
INTEGER, DIMENSION(90000)::GRI,H
REAL*8::HSTEP1X,HSTEP2X
REAL*8::TIME1,TIME2
!Just intiial value
DO I=1, 30000
SUMGRM(I)=I*3
END DO
DO I=1, 90000
GRI(I)=I
H(I)=0.5*I/10000
END DO
!Computing computer's running time (start) : for serial programming
CALL CPU_TIME(TIME1)
DO K=1, 50000
DO I=2, 30000
HSTEP1X=0.0
DO J=SUMGRM(I-1)+1, SUMGRM(I)-1
HSTEP2X=H(GRI(J))/0.99
HSTEP1X=HSTEP1X+HSTEP2X
END DO
HSTEP2X=H(GRI(SUMGRM(I)))/0.99
HSTEP1X=HSTEP1X+HSTEP2X
END DO
END DO
PRINT *, 'Results =', HSTEP1X
PRINT *, ' '
!Computing computer's running time (finish) : for serial programming
CALL CPU_TIME(TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM SINGLE
As you can see, the main problem is located at the most inner side looping (J) which is also a function of most outer side looping (I). I've tried to parallelize this program like this:
PROGRAM PARALLEL
INTEGER, DIMENSION(30000)::SUMGRM
INTEGER, DIMENSION(90000)::GRI,H
REAL*8::HSTEP1X,HSTEP2X
REAL*8::TIME1,TIME2,OMP_GET_WTIME
INTEGER::Q2,P2
!Just intiial value
DO I=1, 30000
SUMGRM(I)=I*3
END DO
DO I=1, 90000
GRI(I)=I
H(I)=0.5*I/10000
END DO
!Computing computer's running time (start) : for parallel programming
TIME1= OMP_GET_WTIME()
DO K=1, 50000
!$OMP PARALLEL DO PRIVATE (HSTEP1X,Q2,P2)
DO I=2, 30000
HSTEP1X=0.0
Q2=SUMGRM(I-1)+1
P2=SUMGRM(I)-1
DO J=Q2, P2
HSTEP2X=H(GRI(J))/0.99
HSTEP1X=HSTEP1X+HSTEP2X
END DO
HSTEP2X=H(GRI(SUMGRM(I)))/0.99
HSTEP1X=HSTEP1X+HSTEP2X
END DO
!$OMP END PARALLEL DO
END DO
PRINT *, 'Results =', HSTEP1X
PRINT *, ' '
!Computing computer's running time (finish) : for parallel programming
TIME2= OMP_GET_WTIME()
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PARALLEL
I'm using gfortran with -O3 -fopenmp and then export OMP_NUM_THREADS=... The parallel program runs faster but the result is different with the single thread code. By the serial program I got 12.1212 (which it is the correct one) and by parallel I got 0.000 (there must be something wrong).
What did I do wrong?

Firstly we can note that by default you're likely to find that both j and hstep2x are going to be shared between threads. I don't think this is really what you want as it will lead to some very odd behaviour were multiple threads are using the same iteration index but are trying to loop over different ranges.
Next let's note that your serial code actually just prints the result for the i=30000 iteration as the value of hstep1x is reset to 0 at the start of each iteration. As such to get the "correct" answer in the openmp code we could just focus on reproducing the final iteration -- this completely negates the point of using openmp here I think. I'm guessing this is just a simple case you're trying to use to represent your real problem -- I think you may have missed some of the real problem in producing this.
Nevertheless the below code produces the "correct" answer on my machine. I'm not sure how flexible it is but it works here.
PROGRAM PARALLEL
INTEGER, DIMENSION(30000)::SUMGRM
INTEGER, DIMENSION(90000)::GRI,H
REAL*8::HSTEP1X,HSTEP2X
REAL*8::TIME1,TIME2,OMP_GET_WTIME
INTEGER::Q2,P2
!Just intiial value
DO I=1, 30000
SUMGRM(I)=I*3
END DO
DO I=1, 90000
GRI(I)=I
H(I)=0.5*I/10000
END DO
!Computing computer's running time (start) : for parallel programming
TIME1= OMP_GET_WTIME()
DO K=1, 50000
!$OMP PARALLEL DO PRIVATE (Q2,P2,J,HSTEP2X) DEFAULT(SHARED) LASTPRIVATE(HSTEP1X)
DO I=2, 30000
HSTEP1X=0.0
Q2= SUMGRM(I-1)+1
P2= SUMGRM(I)-1
DO J=Q2,P2
HSTEP2X=H(GRI(J))/0.99
HSTEP1X=HSTEP1X+HSTEP2X
END DO
HSTEP2X=H(GRI(SUMGRM(I)))/0.99
HSTEP1X=HSTEP1X+HSTEP2X
END DO
!$OMP END PARALLEL DO
END DO
PRINT *, 'Results =', HSTEP1X
PRINT *, ' '
!Computing computer's running time (finish) : for parallel programming
TIME2= OMP_GET_WTIME()
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PARALLEL
I've done three things here:
Make sure j and hstep2x are private to each thread.
Explicitly declared the default behaviour to be shared (not needed here but never mind).
Specified that hstep1x is lastprivate. This means that after exiting the parallel region the value of hstep1x is that taken from the thread which executed the last iteration. (see here for details).

Have you tried using
!$OMP PARALLEL DO DEFAULT(PRIVATE) REDUCTION(+:HSTEP1X)

Related

Bad performance of parallel subroutine

I was trying to parallelize the following code; however, when it was executed on the main program, there didn't seem to be significant speed-up. I tested the same subroutine on another program, and it took even longer time to run than the serial code.
SUBROUTINE rotate(r,qt,n,np,i,a,b)
IMPLICIT NONE
INTEGER n,np,i
DOUBLE PRECISION a,b,r(np,np),qt(np,np)
INTEGER j
DOUBLE PRECISION c,fact,s,w,y
if(a.eq.0.d0)then
c=0.d0
s=sign(1.d0,b)
else if(abs(a).gt.abs(b))then
fact=b/a
c=sign(1.d0/sqrt(1.d0+fact**2),a)
s=fact*c
else
fact=a/b
s=sign(1.d0/sqrt(1.d0+fact**2),b)
c=fact*s
endif
!$omp parallel shared(i,n,c,s,r,qt) private(y,w,j)
!$omp do schedule(static,2)
do 11 j=i,n
y=r(i,j)
w=r(i+1,j)
r(i,j)=c*y-s*w
r(i+1,j)=s*y+c*w
11 continue
!$omp do schedule(static,2)
do 12 j=1,n
y=qt(i,j)
w=qt(i+1,j)
qt(i,j)=c*y-s*w
qt(i+1,j)=s*y+c*w
12 continue
!$omp end parallel
return
END
C (C) Copr. 1986-92 Numerical Recipes Software Vs94z&):9+X%1j49#:`*.
However when I used the built-in function in Linux to measure the time, i got:
real 0m12.160s
user 4m49.894s
sys 0m0.880s
which is ridiculous compared to the time of the serial code:
real 0m2.078s
user 0m2.068s
sys 0m0.000s
So you have something like
do i=1,n
do j=1,n
do k=1,n
call rotate()
end do
end do
end do
for n = 100 and you are parallelizing two simple loops inside rotate.
That is hopeless. If you want decent performance, you must parallelize the outermost loop that is possible.
There is simply not enough work inside the loops inside rotate and it is called too many times. You call it 1000000 times so the threads must be synchronised or re-launched 2000000 times. That takes all of your run time. All the run time increase you see is this synchronization.

Could the compiler vectorize the looping with an array which consists of an array inside?

I would like to vectorize this code below (just for an example), just assume somehow I should write an array inside an array.
PROGRAM TEST
IMPLICIT NONE
REAL, DIMENSION(2000):: A,B,C !100000
INTEGER, DIMENSION(2000):: E
REAL(KIND=8):: TIME1,TIME2
INTEGER::I
DO I=1, 2000 !Actually only this loop could be vectorized
B(I)=100.00 !by the compiler
C(I)=200.00
E(I)=I
END DO
!Computing computer's running time (start)
CALL CPU_TIME (TIME1)
DO I=1, 2000 !This is the problem, somehow I should put
A(E(I))=B(E(I))*C(E(I)) !an integer array E(I) inside an array
END DO !I would like to vectorize this loop also, but it didn't work
PRINT *, 'Results =', A(2000)
PRINT *, ' '
!Computing computer's running time (finish)
CALL CPU_TIME (TIME2)
PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM TEST
I thought at first time, that compiler could understand what I want which somehow be vectorized like this:
DO I=1, 2000, 4 !Unrolled 4 times
A(E(I))=B(E(I))*C(E(I))
A(E(I+1))=B(E(I+1))*C(E(I+1))
A(E(I+2))=B(E(I+2))*C(E(I+2))
A(E(I+3))=B(E(I+3))*C(E(I+3))
END DO
but I was wrong. I used: gfortran -Ofast -o -fopt-info-optimized Tes.F95 and I got the information that only the first looping was successfully to be vectorized.
Do you have any idea how I could vectorize it? Or can't it be vectorized at all?
If E hase equal values for different I, then you would be manipulating the same elements of A multiple times, in which case the order could matter. (Though not in your case.) Also, if you have multiple index arrays, like E1, E2 and E3, and
DO I=1, 2000
A(E3(I))=B(E1(I))*C(E2(I))
END DO
the order could matter too. So I think this kind of indexing is not in general allowed in parallel loops.
With ifort one can use !DIR$ IVDEP which is "ignore Vector dependence". It only works when E(I) is linear as in the example...
Assuming that one wants to do all the indexes then just replace (E(i)) with (I) and work out the obvious E(I) order later...

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Threads summing a variable giving wrong answer in OpenMP

To practice parallelizing the do loop, I am doing the following integral in Fortran
$\integral{0}{1} \frac{4}{1+x^{2}} = \pi$
The following is the code that I implemented:
program mpintegrate
integer i,nmax,nthreads,OMP_GET_NUM_THREADS
real xn,dx,value
real X(100000)
nthreads = 4
nmax = 100000
xn = 0.0
dx = 1.0/nmax
value = 0.0
do i=1,nmax
X(i) = xn
xn = xn + dx
enddo
call OMP_SET_NUM_THREADS(nthreads)
!$OMP Parallel
!$OMP Do Schedule(Static) Private(i,X)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP End DO NoWait
!$OMP End Parallel
print *, value
end
I have no problems compiling the program
gfortran -fopenmp -o mpintegrate mpintegrate.f
The problem is when I execute the program. When I run the program as is, I get values ranging from (1,4). However, when I uncomment the print statement withing the omp do loop, the final value is around what it should be, pi.
Why is the answer in value incorrect?
One problem here is that X needs to not be private (and which needs to be specified on the parallel line, not the do line); everyone needs to see it, and there's no point in having separate copies for each thread. Worse, the results you get from accessing the private copy here is undefined, as that private variable hasn't been initialized once you get into the private region. You could use firstprivate rather than private, which initializes it for you with what was there before the parallel region, but easiest/best here is just shared.
There's also not much point in having the end do be no wait, as the end parallel has to wait for everyone to be done anyway.
However, that being said, you still have a pretty major (and classic) correctness problem. What's happening here is clearer if you're a little more explicit in the loop (dropping the schedule for clarity since the issue doesn't depend on the schedule chosen):
!$OMP Parallel do Private(i) Default(none) Shared(value,X,dx,nmax)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP End Parallel Do
print *, value
Running this repeatedly gives different values:
$ ./foo
1.6643878
$ ./foo
1.5004054
$ ./foo
1.2746993
The problem is that all of the threads are writing to the same shared variable value. This is wrong - everyone is writing at once and the result is gibberish, as a thread can calculate it's own contribution, get ready to add it to value, and just as it's about to, another thread can do its writing to value, which then gets promptly clobbered. Concurrent writes to the same shared variable is a classic race condition, a standard family of bugs that happen particularly often in shared-memory programming like with OpenMP.
In addition to being wrong, it's slow. A number of threads contending for the same few bytes of memory - memory close enough together to fall in the same cache line - can be very slow because of contention in the memory system. Even if they aren't exactly the same variable (as they are in this case), this memory contention - False Sharing in the case that they only happen to be neighbouring variables - can significantly slow things down. Taking out the explicit thread-number setting, and using environment variables:
$ export OMP_NUM_THREADS=1
$ time ./foo
3.1407621
real 0m0.003s
user 0m0.001s
sys 0m0.001s
$ export OMP_NUM_THREADS=2
$ time ./foo
3.1224852
real 0m0.007s
user 0m0.012s
sys 0m0.000s
$ export OMP_NUM_THREADS=8
$ time ./foo
1.1651508
real 0m0.008s
user 0m0.042s
sys 0m0.000s
So things get almost 3 times slower (and increasingly wronger) running with more threads.
So what can we do to fix this? One thing we could to is make sure that everyone's additions aren't overwriting each other, with the atomic directive:
!$OMP Parallel do Schedule(Static) Private(i) Default(none) Shared(X,dx, value, nmax)
do i=1,nmax
!$OMP atomic
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
which solves the correctness problem:
$ export OMP_NUM_THREADS=8
$ ./foo
3.1407621
but does nothing for the speed problem:
$ export OMP_NUM_THREADS=1
$ time ./foo
3.1407621
real 0m0.004s
user 0m0.001s
sys 0m0.002s
$ export OMP_NUM_THREADS=2
$ time ./foo
3.1407738
real 0m0.014s
user 0m0.023s
sys 0m0.001s
(Note you get slightly different answers with different numbers of threads. This is due to the final sum being calculated in a different order than in the serial case. With single precision reals, differences showing up in the 7th digit due to different ordering of operations is hard to avoid, and here we're doing 100,000 operations.)
So what else could we do? One approach is for everyone to keep track of their own partial sums, and then sum them all together when we're done:
!...
integer, parameter :: nthreads = 4
integer, parameter :: space=8
integer :: threadno
real, dimension(nthreads*space) :: partials
!...
partials=0
!...
!$OMP Parallel Private(value,i,threadno) Default(none) Shared(X,dx, partials)
value = 0
threadno = omp_get_thread_num()
!$OMP DO
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP END DO
partials((threadno+1)*space) = value
!$OMP end parallel
value = sum(partials)
print *, value
end
This works - we get the right answer, and if you play with the number of threads, you'll find it's pretty zippy - we've spaced out the entries in the partial sums array to avoid false sharing (and it is false, this time, as everyone is writing to a different entry in the array - no overwriting).
Still, this is a silly amount of work just to get a sum correct across threads! There's a simpler way to do this - OpenMP has a reduction construct to do this automatically (and more efficiently than this handmade version above:)
!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
print *, value
and now the program works correctly, is fast, and the code is fairly simple. The final code, in more modern Fortran, looks something like this:
program mpintegrate
use omp_lib
integer, parameter :: nmax = 100000
real :: xn,dx,value
real :: X(nmax)
integer :: i
integer, parameter :: nthreads = 4
xn = 0.0
dx = 1.0/nmax
value = 0.0
partials=0
do i=1,nmax
X(i) = xn
xn = xn + dx
enddo
call omp_set_num_threads(nthreads)
!$OMP Parallel do reduction(+:value) Private(i) Default(none) Shared(X,dx)
do i=1,nmax
value = value + dx*(4.0/(1+X(i)*X(i)))
enddo
!$OMP end parallel do
print *, value
end

Thread issues when writing to files with OpenMP in Fortran

The number of files that are getting written is always less than the number of threads. Logically for me, when I can have 4 threads and the CPU is working at 400%, I was expecting the number of files to be 4 (one each corresponding to every single thread). I don't know if there is a problem with my code or this is how it is supposed to work. The code is as follows:
!!!!!!!! module
module common
use iso_fortran_env
implicit none
integer,parameter:: dp=real64
real(dp):: aa,bb
contains
subroutine evolve(y,yevl)
implicit none
integer(dp),parameter:: id=2
real(dp),intent(in):: y(id)
real(dp),intent(out):: yevl(id)
yevl(1)=y(2)+1.d0-aa*y(1)**2
yevl(2)=bb*y(1)
end subroutine evolve
end module common
use common
implicit none
integer(dp):: iii,iter,i
integer(dp),parameter:: id=2
real(dp),allocatable:: y(:),yt(:)
integer(dp):: OMP_GET_THREAD_NUM, IXD
allocate(y(id)); allocate(yt(id)); y=0.d0; yt=0.d0; bb=0.3d0
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
!$OMP DO
do iii=1,20000; print*,iii !! EXPECTED THREADS TO BE OF 5000 ITERATIONS EACH
aa=1.d0+dfloat(iii-1)*0.4d0/80000.d0
loop1: do iter=1,10 !! THE INITIAL CONDITION LOOP
call random_number(y)!! RANDOM INITIALIZATION OF THE VARIABLE
loop2: do i=1,70000 !! ITERATION OF THE SYSTEM
call evolve(y,yt)
y=yt
enddo loop2 !! END OF SYSTEM ITERATION
write(IXD+1,*)aa,yt !!! WRITING FILE CORRESPONDING TO EACH THREAD
enddo loop1 !!INITIAL CONDITION ITERATION DONE
enddo
!$OMP ENDDO
!$OMP END PARALLEL
end
Is this behavior resulting from some race issue in the code? The code compiles and executes just fine without any warnings or errors with ifort version 13.1.0 on ubuntu. Thanks a bunch for any comments or suggestions.
The variable IXD should be explicitely declared as private to make sure every thread has an own copy of it. Changing the line(s)
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
to
!$OMP PARALLEL PRIVATE(iii,iter,y,i,yt,ixd) SHARED(bb)
IXD=OMP_GET_THREAD_NUM()
solves the problem.