Nested DO loops in OPEN MP

Nested DO loops in OPEN MP - fortran

I am trying to understand nested DO loops in OPENMP. There are many forums regarding this, but I did not find an answer for the following question.
Let us consider two scenarios. The question will be below the codes
Case a)
!$OMP PARALLEL DO
DO i = 1,N
DO j = 1,M
...code...
END DO
END DO
Case b)
!$OMP PARALLEL DO
DO i = 1,N
!$OMP PARALLEL DO
DO j = 1,M
... code ...
END DO
END DO
Question. I am not sure whether the following statements are correct
In case (a) the first loop will be shared by the threads, like say Thread 1 having i between say 1 and 10, thread 2 having i between 11 and 20 and so on and so forth. But here each thread will have j values between 1 and M, isn't it or will the second loop will also be forked between all threads like the first one?
In case (b) thread 1 can have values of i between 1 and 10 and j between say 10 and 20 (and not the entire range).
Is this how the pattern works? I am sorry if I have failed to express my thoughts and the question is unclear.

The second version makes difference only when nested parallelism in OpenMP is enabled. If yes, then it makes another level dividing of threads. If it is not enabled, it is ignored.
You cannot say how the threads exactly divide the indexes, because you do not specify SCHEDULE and the default is unspecified. But usually the default is implemented as static and the range of indexes is divided evenly.
The first parallel do divides the work among groups of threads with some values of i and the other one divides the values of j to individual threads inside these groups.

Related

Openmp fortran, task dependency on overlapping subarrays

TLDR: task parallel code fails to recognize task dependency on partially overlapping subarrays
I'm trying to parallelize some fortran code using task parallelism. I'm struggling to get the task dependency to work.
The following minimal example illustrates my problem:
program test
integer, parameter :: n = 100
double precision :: A(n,n),B(n,n)
integer :: istart, istop
!$omp parallel
!$omp single
istart = 20
istop = 50
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 1"
call sleep(5)
write(*,*) "stopping task 1"
!$omp end task
istart = 30
istop = 60
!$omp task shared(A,B) firstprivate(istart,istop)&
!$omp& depend( in: B(istart:istop,istart:istop) )&
!$omp& depend( inout: A(istart:istop,istart:istop) )
write(*,*) "starting task 2"
call sleep(5)
write(*,*) "stopping task 2"
!$omp end task
!$omp end single nowait
!$omp end parallel
end program
Which I compile using gcc version 9.4.0 and the -fopenmp flag.
The code has two tasks. Task 1 depends on A(20:50,20:50) and task 2 depends on A(30:60,30:60). These subarrays overlap, so I would expect task 2 to wait until task 1 has completed, but the output I get is:
starting task 1
starting task 2
stopping task 1
stopping task 2
If I comment out the lines
istart = 30
istop = 60
so that the subarrays are exactly the same instead of just overlapping, task 2 does wait on task 1.
Are task dependencies with overlapping subarrays not supported in openmp? Or am I defining the dependencies wrong somehow?

These subarrays overlap
This is forbidden by the OpenMP standard as pointed out by #ThijsSteel.
This choice has been done because of the resulting runtime overhead. Indeed, checking overlapping regions of array in dependencies was very expensive in pathological cases (especially for arrays with many dimensions). A typical example of pathological case when many tasks write on the same array (but on different exclusive parts) and then one tasks operate on the whole array. This create a lot of checks and dependencies while this can be optimized using a specific dependency pattern. An even more pathological case is a transposition of the sub-arrays (from n horizontal bands to n vertical bands): this quickly results in n² dependencies which is insanely inefficient when n is large. An optimization consists in aggregating the dependencies so to get 2*n dependencies but this is expensive to do at runtime and a better way would actually not to use dependency but task group synchronization.
From the user point-of-view, there are only few options:
operate at bigger grain even though it means having more synchronization (so possibly poor performance);
try to regroup tasks operating on distinct sub-array by checking that manually. You can cheat OpenMP by using fake dependencies so to do that or add more tasks synchronizations (typically a taskwait). You can also make use of advanced task-dependence-type modifiers like depobj, inoutset
or mutexinoutset now available in recent version of the OpenMP standard.
Most of the time, an algorithm is typically composed of a several waves of tasks and each wave is operating on non-overlapping sub-arrays. In this case, you can just add a task synchronization at the end of the computation (or use recursive tasks).
This is a known limitation of OpenMP and some researchers worked on it (including me). I expect new specific dependency modifiers to be added in the future, and possibly user-defined partitioning-related features in the long run (some research runtimes like StarPU does that for example, but not everyone agree adding that in OpenMP as this is a bit high-level and not simple to include).

OpenMP SIMD vectorization of nested loop

I am trying to vectorize a nested loop using OpenMP 4.0's simd feature, but I'm afraid I'm doing it wrong. My loops looks like this:
do iy = iyfirst, iylast
do ix = ixfirst, ixlast
!$omp simd
do iz = izfirst, izlast
dudx(iz,ix,iy) = ax(1)*( u(iz,ix,iy) - u(iz,ix-1,iy) )
do ishift = 2, ophalf
dudx(iz,ix,iy) = dudx(iz,ix,iy) + ax(ishift)*( u(iz,ix+ishift-1,iy) - u(iz,ix-ishift,iy) )
enddo
dudx(iz,ix,iy) = dudx(iz,ix,iy)*buoy_x(iz,ix,iy)
enddo
!$omp end simd
enddo
enddo
Note that ophalf is a small integer, usually 2 or 4, so it makes sense to vectorize the iz loop and not the inner-most loop.
My question is: Do I have to mark ishift as a private variable?
In standard OpenMP parallel do loops, you certainly do need a private(ishift) to ensure other threads don't stomp over each other's data. Yet when I instead rewrite the first line as !$omp simd private(ishift), I get the ifort compilation error:
error #8592: Within a SIMD region, a DO-loop control-variable must not be specified in a PRIVATE SIMD clause. [ISHIFT]
Looking online, I couldn't find any successful resolution of this question. It seems to me that ishift should be private, but the compiler is not allowing it. Is an inner-loop variable automatically forced to be private?
Follow-up question: Later, when I add an omp parallel do around the iy loop, should I include a private(ishift) clause in the omp parallel do directive, the omp simd directive, or both?
Thanks for any clarifications.

Private clause when it comes to SIMD essentially means that the value of ishift is private to each SIMD lane within the SIMD register. This is true when we vectorize the innermost loop since ishift is the loop induction variable. But when you do a outer loop vectorization, every SIMD lane will have a different value for iz loop index, but given a loop index iz, ishift can still have values ranging from 2 to ophalf. So it doesn't qualify for private clause in SIMD context.
When it comes to multiple threads, you want copies of ishift so one thread incrementing this variable doesn't enable other thread skip that iteration. So private clause makes sense for ishift in omp parallel do context. It will be interesting to check the underlying code generation if the inner loop is completely unrolled and vectorized for the loop with loop index iz.

OpenMP parallel do read write race condition?

I am a little bit confused about the race conditions that can occur in OpenMP
Specifically, I have two arrays A and B that contains data, and I wish to use the data in one, compute something, and store it to other.
my fortran code would look like this
!$OMP PARALLEL DO PRIVATE(tmp,data)
DO i = 1, 10000
tmp = A(i) !!Extract A(i)
data = Do_Stuff(tmp) !!Compute
B(i)=data !!Store
END DO
!$OMP END PARALLEL DO
are there any lurking race conditions here?
I'm asking because in pages 11-12 in the introduction i'm reading the code bellow has this problem, even though the index i is different for all iterations.
!$OMP PARALLEL DO
do i = 1, 1000
B(i) = 10 * i
A(i) = A(i) + B(i)
end do
!$OMP END PARALLEL DO

There is a race condition in your first example.
The variable data is not explicitly given a data sharing attribute and doesn't have a predetermined attribute, consequently in a parallel construct it is shared. Multiple threads will read and write to it.
There is no such condition in your second example.

Using OpenMP critical and ordered

I've quite new to Fortran and OpenMP, but I'm trying to get my bearings. I have a piece of code for calculating variograms which I'm attempting to parallelize. However, I seem to be getting race conditions, as some of the results are off by a thousandth or so.
The problem seems to be the reductions. Using OpenMP reductions work and give the correct results, but they are not desirable, because the reductions actually happen in another subroutine (I copied the relevant lines into the OpenMP loop for the test). Therefore I put the reductions inside a CRITICAL section but without success. Interestingly, the problem only occurs for reals, not integers. I have thought about whether or not the order of the additions make any difference, but they should not produce errors this big.
Just to check, I put everything in the parallel do in an ORDERED block, which (of course) gave the correct results (albeit without any speedup). I also tried putting everything inside a CRITICAL section, but for some reason that did not give the correct results. My understanding is that OpenMP will flush the shared variables upon entering/exiting CRITICAL sections, so there shouldn't be any cache problems.
So my question is: why doesn't a critical section work in this case?
My code is below. All shared variables except np, tm, hm, gam are read-only.
EDIT: I tried to simulate the randomness induced by multiple threads by replacing the do loops with random integers in the same range (i.e. generate a pair i,j in the of the loops; if they are "visited", generate new ones) and to my surprise the results matched. However, upon further inspection it was revealed that I had forgotten to seed the RNG, and the results were correct by coincidence. How embarrassing!
TL;DR: The discrepancies in the results were caused by the ordering of the floating point values. Using double precision instead helps.
!$OMP PARALLEL DEFAULT(none) SHARED(nd, x, y, z, nzlag, nylag, nxlag, &
!$OMP& dzlag, dylag, dxlag, nvarg, ivhead, ivtail, ivtype, vr, tmin, tmax, np, tm, hm, gam) num_threads(512)
!$OMP DO PRIVATE(i,j,zdis,ydis,xdis,izl,iyl,ixl,indx,vrh,vrt,vrhpr,vrtpr,variogram_type) !reduction(+:np, tm, hm, gam)
DO i=1,nd
!$OMP CRITICAL (main)
! Second loop over the data:
DO j=1,nd
! The lag:
zdis = z(j) - z(i)
IF(zdis >= 0.0) THEN
izl = INT( zdis/dzlag+0.5)
ELSE
izl = -INT(-zdis/dzlag+0.5)
END IF
! ---- SNIP ----
! Loop over all variograms for this lag:
DO cur_variogram=1,nvarg
variogram_type = ivtype(cur_variogram)
! Get the head and tail values:
indx = i+(ivhead(cur_variogram)-1)*maxdim
vrh = vr(indx)
indx = j+(ivtail(cur_variogram)-1)*maxdim
vrt = vr(indx)
IF(vrh < tmin.OR.vrh >= tmax.OR. vrt < tmin.OR.vrt >= tmax) CYCLE
! ----- PROBLEM AREA -------
np(ixl,iyl,izl,1) = np(ixl,iyl,izl,1) + 1. ! <-- This never fails
tm(ixl,iyl,izl,1) = tm(ixl,iyl,izl,1) + vrt
hm(ixl,iyl,izl,1) = hm(ixl,iyl,izl,1) + vrh
gam(ixl,iyl,izl,1) = gam(ixl,iyl,izl,1) + ((vrh-vrt)*(vrh-vrt))
! ----- END OF PROBLEM AREA -----
!CALL updtvarg(ixl,iyl,izl,cur_variogram,variogram_type,vrt,vrh,vrtpr,vrhpr)
END DO
END DO
!$OMP END CRITICAL (main)
END DO
!$OMP END DO
!$OMP END PARALLEL
Thanks very much in advance!

If you are using 32-bit floating-point numbers and arithmetic the difference between 84.26539 and 84.26538, that is a difference of 1 in the least-significant digit, is entirely explicable by the non-determinism of parallel floating-point arithmetic. Bear in mind that a 32-bit f-p number only has about 7 decimal digits to play with.
Ordinary floating-point arithmetic is not strictly associative. For real (in the mathematical not Fortran sense) numbers (a+b)+c==a+(b+c) but there is no such rule for floating-point numbers. This is nicely explained in the Wikipedia article on floating-point arithmetic.
The non-determinism arises because, in using OpenMP you surrender control over the ordering of operations to the run-time. A summation of values across threads (such as a reduction on +) leaves the bracketing of the global sum expression to the run-time. It is not even necessarily true that 2 executions of the same OpenMP program will produce the same-to-the-last-bit results.
I suspect that even running an OpenMP program on one thread may produce different results from the equivalent non-OpenMP program. Since knowledge of the number of threads available to an OpenMP executable may be deferred until run-time the compiler will have to create a parallelised executable whether it is eventually run in parallel or not.

High Performance Mark makes an interesting point about floating point and associativity. This can easily be tested (in C).
float a = -1.0E8f, b = 1.0E8f, c = 1.23456f;
printf("sum %f\n", a+b+c); //output 1.234560
printf("sum %f\n", a+(b+c)); //output 0.000000
But I would like to point out it is possible to preserve order in OpenMP. I discussed this here C++ OpenMP: Split for loop in even chunks static and join data at the end
Edit:
Actually, I confused commutativity and associativity. If you have an operator which is associative but not commuative than it's possible to preserve the order with OpenMP as I did in the post above. However, IEEE floating point is commutative but NOT asssociative so the only thing that came be done is to break IEEE and let it be associative.

Assigning different thread numbers in OpenMP do-loops

I have two do-loops inside OpenMP parallel region as follows:
!$OMP PARALLEL
...
!$OMP DO
...
!$OMP END DO
...
!$OMP DO
...
!$OMP END DO
...
!$OMP END PARALLEL
Let's say OMP_NUM_THREADS=6. I wanted to run first do-loop with 4 threads, and the second do-loop with 3 threads. Can you show how to do it? I want them to be inside one parallel region though. Also is it possible to specify which thread numbers should do either of the do-loops, for example in case of first do-loop I could ask it to use thread numbers 1,2,4, and 5. Thanks.

Well, you can add the num_threads clause to an OpenMP parallel directive but that applies to any directive inside the region. In your case you could split your program into two regions, like
!$OMP PARALLEL DO num_threads(4)
...
!$OMP END PARALLEL DO
...
!$OMP PARALLEL DO num_threads(3)
...
!$OMP END PARALLEL DO
This, of course, is precisely what you say you don't want to do, have only one parallel region. But there is no mechanism for throttling the number of threads in use inside a parallel region. Personally I can't see why anyone would want to do that.
As for assigning parts of the computation to particular threads, again, no, OpenMP does not provide a mechanism for doing that and why would you want to ?
I suppose that I am dreadfully conventional, but when I see signs of parallel programs where the programmer has tried to take precise control over individual threads, I usually see a program with one or more of the following characteristics:
OpenMP directives are used to ensure that the code runs in serial with the result that run time exceeds that of the original serial code;
the program is incorrect because the programmer has failed to deal correctly with the subtleties of data races;
it has been carefully arranged to run only on a specific number of threads.
None of these is desirable in a parallel program and if you want the level of control over numbers of threads and the allocation of work to individual threads you will have to use a lower-level approach than OpenMP provides. Such approaches abound so giving up OpenMP should not limit you.

What you want cannot be achieved with the existing OpenMP constructs but only manually. Imagine that the original parallel loop was:
!$OMP DO
DO i = 1, 100
...
END DO
!$OMP END DO
The modified version with custom selection of the participating threads would be:
USE OMP_LIB
INTEGER, DIMENSION(:), ALLOCATABLE :: threads
INTEGER :: tid, i, imin, imax, tidx
! IDs of threads that should execute the loop
! Make sure no repeated items inside
threads = (/ 0, 1, 3, 4 /)
IF (MAXVAL(threads, 1) >= omp_get_max_threads()) THEN
STOP 'Error: insufficient number of OpenMP threads'
END IF
!$OMP PARALLEL PRIVATE(tid,i,imin,imax,tidx)
! Get current thread's ID
tid = omp_get_thread_num()
...
! Check if current thread should execute part of the loop
IF (ANY(threads == tid)) THEN
! Find out what thread's index is
tidx = MAXLOC(threads, 1, threads == tid)
! Compute iteration range based on the thread index
imin = 1 + ((100-1 + 1)*(tidx-1))/SIZE(threads)
imax = 1 + ((100-1 + 1)*tidx)/SIZE(threads) - 1
PRINT *, 'Thread', tid, imin, imax
DO i = imin, imax
...
END DO
ELSE
PRINT *, 'Thread', tid, 'not taking part'
END IF
! This simulates the barrier at the end of the worksharing construct
! Remove in order to implement the "nowait" clause
!$OMP BARRIER
...
!$OMP END PARALLEL
Here are three example executions:
$ OMP_NUM_THREADS=2 ./custom_loop.x | sort
STOP Error: insufficient number of OpenMP threads
$ OMP_NUM_THREADS=5 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
$ OMP_NUM_THREADS=7 ./custom_loop.x | sort
Thread 0 1 33
Thread 1 34 66
Thread 2 not taking part
Thread 3 not taking part
Thread 4 67 100
Thread 5 not taking part
Thread 6 not taking part
Note that this is an awful hack and goes against the basic premises of the OpenMP model. I would strongly advise against doing it and relying on certain threads to execute certain portions of the code as it creates highly non-portable programs and hinders runtime optimisations.
If you decide to abandon the idea of explicitly assigning the threads that should execute the loop and only want to dynamically change the number of threads, then the chunk size parameter in the SCHEDULE clause is your friend:
!$OMP PARALLEL
...
! 2 threads = 10 iterations / 5 iterations/chunk
!$OMP DO SCHEDULE(static,5)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
! 10 threads = 10 iterations / 1 iteration/chunk
!$OMP DO SCHEDULE(static,1)
DO i = 1, 10
PRINT *, i, omp_get_thread_num()
END DO
!$OMP END DO
...
!$OMP END PARALLEL
And the output with 10 threads:
$ OMP_NUM_THREADS=10 ./loop_chunks.x | sort_manually :)
First loop
Iteration Thread ID
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 1
Second loop
Iteration Thread ID
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js