TLDR: task parallel code fails to recognize task dependency on partially overlapping subarrays
I'm trying to parallelize some fortran code using task parallelism. I'm struggling to get the task dependency to work.
The following minimal example illustrates my problem:
program test
integer, parameter :: n = 100
double precision :: A(n,n),B(n,n)
integer :: istart, istop
!$omp parallel
!$omp single
istart = 20
istop = 50
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 1"
call sleep(5)
write(*,*) "stopping task 1"
!$omp end task
istart = 30
istop = 60
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 2"
call sleep(5)
write(*,*) "stopping task 2"
!$omp end task
!$omp end single nowait
!$omp end parallel
end program
Which I compile using gcc version 9.4.0 and the -fopenmp flag.
The code has two tasks. Task 1 depends on A(20:50,20:50) and task 2 depends on A(30:60,30:60). These subarrays overlap, so I would expect task 2 to wait until task 1 has completed, but the output I get is:
starting task 1
starting task 2
stopping task 1
stopping task 2
If I comment out the lines
istart = 30
istop = 60
so that the subarrays are exactly the same instead of just overlapping, task 2 does wait on task 1.
Are task dependencies with overlapping subarrays not supported in openmp? Or am I defining the dependencies wrong somehow?
These subarrays overlap
This is forbidden by the OpenMP standard as pointed out by #ThijsSteel.
This choice has been done because of the resulting runtime overhead. Indeed, checking overlapping regions of array in dependencies was very expensive in pathological cases (especially for arrays with many dimensions). A typical example of pathological case when many tasks write on the same array (but on different exclusive parts) and then one tasks operate on the whole array. This create a lot of checks and dependencies while this can be optimized using a specific dependency pattern. An even more pathological case is a transposition of the sub-arrays (from n horizontal bands to n vertical bands): this quickly results in n² dependencies which is insanely inefficient when n is large. An optimization consists in aggregating the dependencies so to get 2*n dependencies but this is expensive to do at runtime and a better way would actually not to use dependency but task group synchronization.
From the user point-of-view, there are only few options:
operate at bigger grain even though it means having more synchronization (so possibly poor performance);
try to regroup tasks operating on distinct sub-array by checking that manually. You can cheat OpenMP by using fake dependencies so to do that or add more tasks synchronizations (typically a taskwait). You can also make use of advanced task-dependence-type modifiers like depobj, inoutset
or mutexinoutset now available in recent version of the OpenMP standard.
Most of the time, an algorithm is typically composed of a several waves of tasks and each wave is operating on non-overlapping sub-arrays. In this case, you can just add a task synchronization at the end of the computation (or use recursive tasks).
This is a known limitation of OpenMP and some researchers worked on it (including me). I expect new specific dependency modifiers to be added in the future, and possibly user-defined partitioning-related features in the long run (some research runtimes like StarPU does that for example, but not everyone agree adding that in OpenMP as this is a bit high-level and not simple to include).
Related
I have the following problem and wonder how to achieve this in MPI-Fortran.
Suppose that I have N0 nodes in my cluster, given an interval (1,N), I divide it into N0 segments denoted by (i1,i2) and suppose the calculation time for each (i1,i2) is some T, done in parallel. Now if I make an additional partition of each sub-interval (i1,i2) into N0 smaller ones, let's called (a1,a2), at a given time, I want to work with each (i1,i2) sequentially but with every (a1,a2) in parallel, would the calculation time for (i1,i2) be T/N0, i.e. significantly reduced with respect to the first partition ? and if it does, how can I achieve this idea in MPI ?
Thank you for your suggestions and help.
To update the post, I have come up with the following schematical lines. I hope it is lisible enough to understand, where
i : index of the sub-interval (i1,i2)
rank: this is the rank of the current node
sub_sub_interval(rank, i): variable containing all the subintervals (a1,a2) of the interval (i1,i2).
.
do i = 0, number_machine-1
call MPI_BARRIER(MPI_COMM_WORLD, ierr)
start = MPI_Wtime()
call compute_my_quantity(A, sub_interval(i),
sub_sub_interval(rank,i))
end = MPI_Wtime()
if (rank /= 0) then
call SEND_my_quantity(rank, A)
else
do i_prime = 1, number_machine-1
call RECEV_my_quantity(i_prime, A_sub(i_prime))
end do
end if
end do
What I get is that: time measured for each (a1,a2) is still the same as
start = MPI_Wtime()
call compute_my_quantity(A, sub_interval(rank))
end = MPI_Wtime()
if (rank /= 0) then
call SEND_my_quantity(rank, A)
else
do i_prime = 1, number_machine-1
call RECEV_my_quantity(i_prime, A_sub(i_prime))
end do
end if
Maybe you can tell me what is conceptually wrong in this algorithm such that the time needed for (a1,a2) is the same as for (i1,i2) ?
As I understood from the question,
You have some tasks (say N) that is divided into N0 segments, with each segment taking some N/N0 tasks.
In strategy 1, you distribute this among N0 (or more) ranks and works in parallel.
In strategy 2, you take, a task out of N tasks and distribute this tasks among MPI ranks. Likewise, each tasks are executed sequentially.
Theoretically, strategy1 will always be faster than strategy 2 because of the underlying overheads imposed by the communication. In strategy 2, there will be lot (N0 times more than strategy 1) of communication. So your application performance will be degraded.
Ideal approach will be using hybrid programming model in where you distribute the tasks among nodes using MPI and in each node you use some shared memory paradigm (openMP) to speed up the execution. Basically, it is a combination of your strategy 1 and strategy 2.
I have a nested loop that contains 3 counters (i,j and k). since I run this code on a computer that has multiprocessor CPU with 8 cores. I intent to remove the inner do loop (do k=) and make the code run parallel so each core calculate separate f(i,j,k) if k=N (N=1,2,...,8). could anyone help with this?
do i=1,nx
do j=1,ny
do k=1,8
f(i,j,k)= omegaP*f(i,j,k)+omega*feq(i,j,k)+fi(i,j,k)*dt
end do
end do
end do
I am having issues with openmp, described as follows:
I have the serial code like this
subroutine ...
...
do i=1,N
....
end do
end subroutine ...
and the openmp code is
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end parallel do
end subroutine ...
No issues with compiling, however when I run the program, there are two major issues compared to the result of serial code:
The program is running even slower than the serial code (which supposedly do matrix multiplications (matmul) in the do-loop
The numerical accuracy seems to have dropped compared to the serial code (I have a check for it)
Any ideas what might be going on?
Thanks,
Xiaoyu
In case of an parallelization using OpenMP, you will need to specify the number of threads your program is to use. You can do so by using the environment variable OMP_NUM_THREADS, e.g. calling your program by means of
OMP_NUM_THREADS=5 ./myprogram
to execute it using 5 threads.
Alternatively, you may set the number of threads at runtime omp_set_num_threads (documentation).
Side Notes
Don't forget to set private variables, if there are any within the loop!
Example:
!$omp parallel do private(prelimRes)
do i = 1, N
prelimRes = myFunction(i)
res(i) = prelimRes + someValue
end do
!$omp end parallel do
Note how the variable prelimRes is declared private so that every thread has its own workspace.
Depending on what you actually do within the loop (i.e. use OpenBLAS), your results may indeed vary (variations should be smaller than 1e-8 with regard to double precision variables) due to the differing, parellel processing.
If you are unsure about what is happening, you should check the CPU load using htop or a similar program while your program is running.
Addendum: Setting the number of threads to automatically match the number of CPUs
If you would like to use the maximum number of useful threads, e.g. use as many threads as there are CPUs, you can do so by using (just like you stated in your question):
subroutine ...
use omp_lib
...
call omp_set_num_threads(omp_get_num_procs())
!$omp parallel do
do i=1,N
....
end do
!$omp end do
!$omp end parallel
end subroutine ...
I inherited a piece of Fortran code as am tasked with parallelizing it for the 8-core machine we have. I have two version of the code, and I am trying to use openMP compiler directives to speed it up. It works on one piece of code, but not the other, and I cannot figure out why--They're almost identical!
I ran each piece of code with and without the openMP tags, and the first one showed speed improvements, but not the second one. I hope I am explaining this clearly...
Code sample 1: (significant improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATT(IN1,IN2) = DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2))
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
Code sample 2: (no improvement)
!$OMP PARALLEL DO
DO IN2=1,NN(2)
DO IN1=1,NN(1)
SCATTREL = DATA(2*((IN2-1)*NN(1)+IN1)-1))/NN(1)*NN(2))
SCATTIMG = DATA(2*((IN2-1)*NN(1)+IN1)))/NN(1)*NN(2))
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
ENDDO
ENDDO
!$OMP END PARALLEL DO
I thought it might be issues with memory ovehead and such, and have tried various combinations of putting variables in shared() and private() clauses, but they either cause segmentations faults or make it even slower.
I also thought it might be that I'm not doing enough work in the loop to see an improvement, but since there's improvement in the smaller loop that doesn't make sense to me.
Can anyone shed some light onto what I can to do see a real speed boost in the second one?
Data on speed boost for code sample 1:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 2m 21.321s
With openMP tags: 2m 20.640s
Average runtime (profile for just this snippet)
Without openMP tags: 6.3s
With openMP tags: 4.75s
Data on speed boost for code sample 2:
Average runtime (for the whole code not just this snippet)
Without openMP tags: 4m 46.659s
With openMP tags: 4m 49.200s
Average runtime (profile for just this snippet)
Without openMP tags: 15.14s
With openMP tags: 46.63s
The observation that the code runs slower in parallel than in serial tells me that the culprit is very likely false sharing.
The SCATT array is shared and each thread accesses a slice of it for both reading and writing. There is no race condition in your code however the threads writing to the same array (albeit different slices) make things slower.
The reason is that each thread loads a portion of the array SCATT in cache and whenever another thread writes in that portion of SCATT this invalidates the data previously stored in cache. Although the input data has not been changed since there is no race condition (the other thread updated a different slice of SCATT) the processor gets a signal that cache is invalid and thus reloads the data (see the link above for details). This causes high data transfer overhead.
The solution is to make each slice private to a given thread. In your case it is even simpler as you do not require reading access to SCATT at all. Just replace
SCATT(IN1,IN2) = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0
with
SCATT0 = DCOMPLX(SCATREL, SCATIMG)
UADI(IN1,IN2) = SCATT0+1.0
SCATT(IN1,IN2) = SCATT0
where SCATT0 is a private variable.
And why this does not happen in the first snippet? It certainly does however I suspect that the compiler might have optimized the problem away. When it calculated DATA((IN2-1)*NN(1)+IN1)/(NN(1)*NN(2)) it very likely stored it in a register and used this value instead of SCATT(IN1,IN2) in UADI(IN1,IN2) = SCATT(IN1,IN2)+1.0.
Besides if you want to speed the code up you should make the loops more efficient. The first rule of parallelization is don't do it! Optimize the serial code first. So replace snippet 1 with (you could even through in workshare around the last line)
DATA/(NN(1)*NN(2))
!$OMP PARALLEL DO private(temp)
DO IN2=1,NN(2)
temp = (IN2-1)*NN(1)
SCATT(:,IN2) = DATA(temp+1:temp+NN(1))
ENDDO
!$OMP END PARALLEL DO
UADI = SCATT+1.0
You can do something similar with snippet 2 as well.
I have this sequential code in Fortran. My problem is, when I put Openmp directives, the paralleled code is more slow than the sequential, and I don't see the error.
REAL, DIMENSION(:), ALLOCATABLE :: current, next
ALLOCATE ( current(TOTAL_Z), next(TOTAL_Z))
CALL CPU_TIME(t1)
!$OMP PARALLEL SHARED (current, next) PRIVATE (z)
DO t = 1, TOTAL_TIME
!$OMP DO SCHEDULE(STATIC, 2)
DO z = 2, (TOTAL_Z - 1)
next(z) = current (z) + KAPPA*DELTA_T*((current(z - 1) - 2.0*current(z) + current(z + 1)) / DELTA_Z**2)
END DO
!$OMP END DO
current = next
END DO
CALL CPU_TIME(t2)
!$OMP END PARALLEL
TOTAL_Z, TOTAL_TIME, KAPPA, DELTA_T, DELTA_Z are constants.
When I run the paralleled code, I see in htop and my 2 cores are working at 100%
In sequential code, CPU_TIME is 79 seg and in paralleled is 132 seg
Thank
I've just been experiencing the same problem.
It seems that using cpu_time() is not suitable to measure the performance of multi-threaded code. cpu_time() will add the total time of all the threads which is likely to increase with increasing number of threads.
I've found this in another forum,
http://software.intel.com/en-us/forums/topic/281897
You should use system_clock() or omp_get_wtime() functions to get a more accurate timing of your routine.
It is probably slow because of the threads are contending to access the shared variables. If you can change it to use reduction it would likely be faster. But that might not be easy since the calculation for "current" accesses multiple array elements.
depending on the number of iterations, you might also be facing a problem with false-sharing on the nest array. Since the chunk size for the distribution of the DO loop is rather small, the cache line for nest(z), nest(z+1), nest(z+2), nest(z+3), etc might be thrashing between the L1/L2 caches of the CPU.
Cheers,
-michael