Fortran + Openmp more slow that sequential - fortran

I have this sequential code in Fortran. My problem is, when I put Openmp directives, the paralleled code is more slow than the sequential, and I don't see the error.
REAL, DIMENSION(:), ALLOCATABLE :: current, next
ALLOCATE ( current(TOTAL_Z), next(TOTAL_Z))
CALL CPU_TIME(t1)
!$OMP PARALLEL SHARED (current, next) PRIVATE (z)
DO t = 1, TOTAL_TIME
!$OMP DO SCHEDULE(STATIC, 2)
DO z = 2, (TOTAL_Z - 1)
next(z) = current (z) + KAPPA*DELTA_T*((current(z - 1) - 2.0*current(z) + current(z + 1)) / DELTA_Z**2)
END DO
!$OMP END DO
current = next
END DO
CALL CPU_TIME(t2)
!$OMP END PARALLEL
TOTAL_Z, TOTAL_TIME, KAPPA, DELTA_T, DELTA_Z are constants.
When I run the paralleled code, I see in htop and my 2 cores are working at 100%
In sequential code, CPU_TIME is 79 seg and in paralleled is 132 seg
Thank

I've just been experiencing the same problem.
It seems that using cpu_time() is not suitable to measure the performance of multi-threaded code. cpu_time() will add the total time of all the threads which is likely to increase with increasing number of threads.
I've found this in another forum,
http://software.intel.com/en-us/forums/topic/281897
You should use system_clock() or omp_get_wtime() functions to get a more accurate timing of your routine.

It is probably slow because of the threads are contending to access the shared variables. If you can change it to use reduction it would likely be faster. But that might not be easy since the calculation for "current" accesses multiple array elements.

depending on the number of iterations, you might also be facing a problem with false-sharing on the nest array. Since the chunk size for the distribution of the DO loop is rather small, the cache line for nest(z), nest(z+1), nest(z+2), nest(z+3), etc might be thrashing between the L1/L2 caches of the CPU.
Cheers,
-michael

Related

Poor scaling and a segmentation fault in a Fortran OpenMP code

I'm having some trouble when executing a program with a parallel do. Here is a test code.
module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains
subroutine Addition(x,y,s)
real(dp),intent(in) :: x,y
real(dp), intent(out) :: s
s = x+y
end subroutine Addition
function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
real(dp),intent(in) :: xi,xf
integer, intent(in) :: length
real(dp),dimension(1:length) :: vec
integer ::i
real(dp) :: increment
increment = (xf-xi)/(real(length)-1)
vec(1) = xi
do i = 2,length
vec(i) = vec(i-1) + increment
end do
end function linspace
end module test
program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix
x = linspace(length,.0d0,1.0d0)
y = linspace(length,2.0d0,3.0d0)
!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
do j = 1,size(y)
call Addition(x(i),y(j),s)
SMatrix(i,j) = s
end do
end do
!$OMP END PARALLEL DO
open(unit=1,file ='Add6.dat')
do i= 1,size(x)
do j= 1,size(y)
write(1,*) x(i),";",y(j),";",SMatrix(i,j)
end do
end do
close(unit=1)
end program paralleltest
I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8
This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.
As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.
Regarding the time, almost all of the runtime is spent in writing the array to the external file.
But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).
As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.
And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).
With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.
If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.
Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

Bad performance of parallel subroutine

I was trying to parallelize the following code; however, when it was executed on the main program, there didn't seem to be significant speed-up. I tested the same subroutine on another program, and it took even longer time to run than the serial code.
SUBROUTINE rotate(r,qt,n,np,i,a,b)
IMPLICIT NONE
INTEGER n,np,i
DOUBLE PRECISION a,b,r(np,np),qt(np,np)
INTEGER j
DOUBLE PRECISION c,fact,s,w,y
if(a.eq.0.d0)then
c=0.d0
s=sign(1.d0,b)
else if(abs(a).gt.abs(b))then
fact=b/a
c=sign(1.d0/sqrt(1.d0+fact**2),a)
s=fact*c
else
fact=a/b
s=sign(1.d0/sqrt(1.d0+fact**2),b)
c=fact*s
endif
!$omp parallel shared(i,n,c,s,r,qt) private(y,w,j)
!$omp do schedule(static,2)
do 11 j=i,n
y=r(i,j)
w=r(i+1,j)
r(i,j)=c*y-s*w
r(i+1,j)=s*y+c*w
11 continue
!$omp do schedule(static,2)
do 12 j=1,n
y=qt(i,j)
w=qt(i+1,j)
qt(i,j)=c*y-s*w
qt(i+1,j)=s*y+c*w
12 continue
!$omp end parallel
return
END
C (C) Copr. 1986-92 Numerical Recipes Software Vs94z&):9+X%1j49#:`*.
However when I used the built-in function in Linux to measure the time, i got:
real 0m12.160s
user 4m49.894s
sys 0m0.880s
which is ridiculous compared to the time of the serial code:
real 0m2.078s
user 0m2.068s
sys 0m0.000s
So you have something like
do i=1,n
do j=1,n
do k=1,n
call rotate()
end do
end do
end do
for n = 100 and you are parallelizing two simple loops inside rotate.
That is hopeless. If you want decent performance, you must parallelize the outermost loop that is possible.
There is simply not enough work inside the loops inside rotate and it is called too many times. You call it 1000000 times so the threads must be synchronised or re-launched 2000000 times. That takes all of your run time. All the run time increase you see is this synchronization.

Calling subroutine in parallel environment

I think my problem is related or even identical to the problem described here. But I don't understand what's actually happening.
I'm using openMP with the gfortran compiler and I have the following task to do: I have a density distribution F(X, Y) on a two-dimensional surface with x-coordinates X and y-coordinates Y. The matrix F has the size Nx x Ny.
I now have a set of coordinates Xp(i) and Yp(i) and I need to interpolate the density F onto these points. This problem is made for parallelization.
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
Everything is shared except for i. The function interp2d is doing some simple linear interpolation.
That works fine with one thread but fails with multithreading. I traced the problem down to the hunt-subroutine taken from Numerical Recipes, which gets called by interp2d. The hunt-subroutine basically calculates the index ix such that X(ix) <= Xp(i) < X(ix+1). This is needed to get the starting point for the interpolation.
With multithreading it happens every now and then, that one threads gets the correct index ix from hunt and the thread, that calls hunt next gets the exact same index, even though Xp(i) is not even close to that point.
I can prevent this by using the CRITICAL environment:
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(i)
do i=1, Nmax
! Some stuff to be done here
!$OMP CRITICAL
Fint(i) = interp2d(Xp(i), Yp(i), X, Y, F, Nx, Ny)
!$OMP END CRITICAL
! Some other stuff to be done here
end do
!$OMP END PARALLEL DO
But this decreases the efficiency. If I use for example three threads, I have a load average of 1.5 with the CRITICAL environment. Without I have a load average of 2.75, but wrong results and even sometimes a SIGSEGV runtime error.
What exactly is happening here? It seems to me that all the threads are calling the same hunt-subroutine and if they do it at the same time there is a conflict. Does that make sense?
How can I prevent this?
Combining variable declaration and initialisation in Fortran 90+ has the side effect of giving the variable the SAVE attribute.
integer :: i = 0
is roughly equivalent to:
integer, save :: i
if (first_invocation) then
i = 0
end if
SAVE'd variables retain their value between multiple invocations of the routine and are therefore often implemented as static variables. By the rules governing the implicit data sharing classes in OpenMP, such variables are shared unless listed in a threadprivate directive.
OpenMP mandates that compliant compilers should apply the above semantics even when the underlying language is Fortran 77.

programming issue with openmp

I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...

Using OpenMP critical and ordered

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.
The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.
Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.
So my question is: why doesn't a critical section work in this case?
My code is below. All shared variables except np, tm, hm, gam are read-only.
EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!
TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.
!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
DO i=1,nd
!$OMP CRITICAL (main)
! Second loop over the data:
DO j=1,nd
! The lag:
zdis = z(j) - z(i)
IF(zdis >= 0.0) THEN
izl = INT( zdis/dzlag+0.5)
ELSE
izl = -INT(-zdis/dzlag+0.5)
END IF
! ---- SNIP ----
! Loop over all variograms for this lag:
DO cur_variogram=1,nvarg
variogram_type = ivtype(cur_variogram)
! Get the head and tail values:
indx = i+(ivhead(cur_variogram)-1)*maxdim
vrh = vr(indx)
indx = j+(ivtail(cur_variogram)-1)*maxdim
vrt = vr(indx)
IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE
! ----- PROBLEM AREA -------
np(ixl,iyl,izl,1) = np(ixl,iyl,izl,1) + 1. ! <-- This never fails
tm(ixl,iyl,izl,1) = tm(ixl,iyl,izl,1) + vrt
hm(ixl,iyl,izl,1) = hm(ixl,iyl,izl,1) + vrh
gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
! ----- END OF PROBLEM AREA -----
!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$OMP END CRITICAL (main)
END DO
!$OMP END DO
!$OMP END PARALLEL
Thanks very much in advance!
If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.
Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.
The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.
I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.
High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).
float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c); //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000
But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end
Edit:
Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.