Openmp: How to collect an array from different threads? - c++

I am a OpenMP newbie and I am stuck at a problem! I have an array which is summed in a loop but I am having problems parallelizing this. Can you suggest how to do this? The main loop is sketched as follows:
REAL A (N) ! N is an integer
!$OMP PARALLEL
DO I=1,10 ! just an illustration
DO J=1,10 ! can't post real code sorry, illegal
!$OMP DO
DO K=1,100
call messy_subroutine_that_sums_A(I,J, K, A) ! each thread returns its own sum
! each thread returns its own A right? and needs to summed quickly
END DO
!$OMP END DO
END DO
END DO
SUBROUTINE messy_subroutine_that_sums_A(I,J, K, A)
REAL A(N) ! an array
! basically adds some stuff to the previous value of A
A=A+junk ! this is really junk
END SUBROUTINE messy_subroutine_that_sums_A
my problem is that all my attempts to collect A from all the threads have failed. If you notice A is summed over outer loops as well. What is the correct and a fast procedure to collect A from all the arrays as a sum. Secondly, my question is not just a Fortran question, it applies equally to C and C++. It is a conceptual question.

Actually OpenMP does support reduction on arrays in C, C++ (declared recently in OpenMP 4.1 comment draft release), and of cause in Fortran.
Not all implementations may support this feature, so you would better first check if you compiler supports it. To see if you have correct calculations you can start with placing A=A+junk into critical section:
!$omp critical
A=A+junk
!$omp end critical
This should give you same correct answer in all threads after the OMP DO loop.
Then you can optimize performance using array reduction instead of critical on OMP DO loop, again having same correct answer in all threads after the loop.
Then you can further optimize performance moving the reduction to OMP PARALLEL, but you won't be able to check values in all threads in this case, because all threads will work with private copies of the array A, and thus have different partial values. Final correct answer will be available in master thread only, after the OMP PARALLEL.
Note, you don't need to declare loop iteration variables private in Fortran, as they should be automatically made private.

Related

OMP : Is REDUCTION mandatory for an array when in a DO loop there is no concurrency between cells?

I wrote the following program that works as expected.
program main
use omp_lib
implicit none
integer :: i,j
integer, dimension(1:9) :: myArray
myArray=0
do j=0,3
!$OMP PARALLEL DO DEFAULT(SHARED)
do i=1,9
myArray(i) = myArray(i) + i*(10**j)
end do
!$OMP END PARALLEL DO
enddo
print *, myArray
end program main
As only one thread writes to the i th cell of myArray, REDUCTION on myArray has not been used. But, I wonder if REDUCTION (+ : myArray) must be added in the OMP DO or if it is unuseful ? In others terms, what is important : the array or the cell of the array ?
What is important : the array or the cell of the array ?
Cells. !$OMP PARALLEL DO DEFAULT(SHARED) is fine as long as the loop is embarrassingly parallel. When Threads operate on the same location in memory, this cause a race condition. When a reduction is performed cell-wise, then REDUCTION (+ : myArray) can be added. That being said, once should note that the full array will likely be replicated in each threads temporary before doing the final reduction of the whole array. Since your loop is embarrassingly parallel, REDUCTION (+ : myArray) is not needed here.
Besides, note that the number of iterations is too small for multiple threads to make the code faster. In fact, it will be slower because of the time spent to create/join threads, the time to distribute the work amongst threads and also because of an effect called false-sharing (which will nearly serialize the execution of the loop here).

How to collect data for each thread OpenMP

I'm new to OpenMP and try to sort out the issue of collecting data from threads. I study the example of applying OpenMP on Monte-Carlo method (square of a circle inscribed into a square).
I understood how the following code works:
unsigned pointsInside = 0;
#pragma omp parallel for num_threads(threadNum) shared(threadNum) reduction(+: pointsInside)
for (unsigned i = 0; i < threadNum; i++) { ... }
Am I right that originally pointsInsideis a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
But the main question is how to collect information directly into an array or vector? I tried to declare array or vector and provide pointer or address into OpenMP via shared and collect information for each thread at corresponding index. But it works slower than the way with the variable and reduction. Such the approach with vector or array is needed for me for my current project. Thanks a lot!
UPD:
When I said above that "it works slower" I meant comparison of two realizations of the Monte-Carlo method: 1) via shared and a vector/array, and 2) via a scalar variable and reduction. The first case is faster. My guess and question about it below.
I would like to rephrase my question more clear. I create a vector/array and provide it into OpenMP via shared. I want to collect data for each thread at corresponding index in vector/array. Under this approach I don't need any synchronization of access to the vector/array. Is it true that OpenMP enable synchronization by default when I use shared. If it is so, then how to disable it. Or may be another approaches exist. If it is not so, then how to share vector/array into the parallel part correctly and without synchronization of access.
I'd like to apply this technique for my project where I want to sort through different permutations in parallel part, collect each permutation and scalar result for it outside of the parallel part. Then sort the results and choose the best one.
A partial answer:
Am I right that originally pointsInside is a variable but OpenMP represents it as an array and than the mantra reduction(+: pointsInside) sums over the elements of the "array"?
I think it is better to think of pointsInside as a scalar. When the parallel region starts the run-time takes care of creating individual scalars, perhaps you might think of them as myPointsInside, one such scalar for each thread. When the parallel region finishes the run-time reduces the values of all the thread scalars onto the original scalar pointsInside. This is just about what OpenMP actually does behind the scenes.
As to the rest of your question:
Yes, you can perform reductions onto arrays - but this was only added to OpenMP, for C and C++ programs, in OpenMP 4.5 (I think). What goes on is much the same as for the scalar case. This Q&A provides some assistance - Is it possible to do a reduction on an array with openmp?
As to the speed, it's difficult to answer that without a much clearer understanding of what comparisons you are making. But it's very easy to write parallel reductions on arrays which incur a significant penalty in performance from the phenomenon of false sharing, about which you may wish to inform yourself.

Not able to achieve desired speed up using openmp

I am trying to use openmp directive to parallelize a piece of code but not being able to achieve any speed up. Folowing is the piece of code that I am trying to parllelize:
#pragma omp parallel private(i,j) shared(a,x,n) default(none)
{
for(j=n-1;j>=0;j--)
{
x[j] = A(j,n,n)/A(j,j,n);
#pragma omp for schedule(dynamic)
for (i=0;i<=j-1;i++)
{
A(i,n,n )= A(i,n,n) - A(i,j,n)*x[j];
}
}
}
The value of n is 1000. The A(i,n,n) is defined macro which is used to access to array a.
As I increase the number of threads the execution time increases or it remains the same. The machine I am working on has 4 cores. I am suprised that that there is no speed up even when the number of threads is 2.
I am not able to figure what am I doing wrong?
Since n>>#CPUs (I don't think you have 1000 CPUs), it is not wise to parallelize the inner loop. In your example, you redistribute the work at each iteration.
Instead, it is wiser to parallelize the outer loop. This way, the value of x[j] won't be updated concurrently by different threads (as Zulan mentioned), and you will have much less work re-distribution.
In that case, using dynamic scheduling is wise since the quantity of work change at each iteration.
Note: You will have to change the order of the calculation, the current implementation does not allow you to move the parallelization to the outer loop since all of the threads will update the same value (A[i][n][n]).
Although it is true that threads creating take time, the threads are not recreated at each iteration. They are only created once on the top #pargma statement and running concurrently for the entire following clause.

openmp: seq processing within subroutine

I would like to carry out calculations in parallel by calling
!$OMP PARALLEL
call sub
!$OMP END PARALLEL
this fires of n (= no. of threads) calls tosub, as desired.
within sub, I am reading from separate data files, with the file and the no. of reads determined from the thread number. This works if I enclose everything in sub in !$OMP ORDERED ... !$OMP END ORDERED.
however, this also causes all threads to run strictly sequentially, i.e. thread 0 runs first and all reads in it are complete before thread 1 starts, etc.
what I would like to achieve though is that all n threads run concurrently and only processing within a thread is sequentially (as reads are from different data files). any idea how? replacing ORDERED by TASK does not help.

Signaling in OpenMP

I am writing computational code that more-less has the following schematic:
#pragma omp parallel
{
#pragma omp for nowait
// Compute elements of some array A[i] in parallel
#pragma omp single
for (i = 0; i < N; ++i) {
// Do some operation with A[i].
// This time it is important that operations are sequential. e.g.:
result = compute_new_result(result, A[i]);
}
}
Both computing A[i] and compute_new_result are rather expensive. So my idea is to compute the array elements in parallel and if any of the threads gets free, it starts doing sequential operations. There is a good chance that the starting array elements are already computed and the others will be provided by the other threads doing still the first loop.
However, to make the concept work I have to achieve two things:
To make OpenMP split the loops in alternative way, i.e. for two threads: thread 1 computing A[0], A[2], A[4] and thread 2: A[1], A[3], A[5], etc.
To provide some signaling system. I am thinking about an array of flags indicating that A[i] has already been computed. Then compute_new_result should wait for the flag for respective A[i] to be released before proceeding.
I would be glad for any hints how to achieve both goals. I need the solution to be portable across Linux, Windows and Mac. I am writing the whole code in C++11.
Edit:
I have figured out the answer to the fist question. It looks like it is sufficient do add schedule(static,1) clause to the #pragma omp for directive.
However, I am still thinking on the elegant solution of the second issue...
If you don't mind replacing the OpenMP for worksharing construct with a loop that generates tasks instead, you can use OpenMP task to implement both parts of your application.
In the first loop you would create (instead of the loop chunks), tasks that take on the compute load of the iterations. Each iteration of the second loop then also becomes an OpenMP task. The important part then will be to syncronize the tasks between the different phases.
For that you can use task dependencies (introduce with OpenMP 4.0):
#pragma omp task depend(out:A[0])
{ A[0] = a(); }
#pragma omp task depend(in:A[0])
{ b(A[0]); }
Will make sure that task b does not run before task a has completed.
Cheers,
-michael
This is probably an extended comment rather than an answer ...
So, you have a two-phase computation. In phase 1 you can compute, independently, each entry in your array A. It is straightforward therefore to parallelise this using an OpenMP parallel for loop. But there is an issue here, naive allocations of work to threads are likely to lead to a (severely ?) unbalanced load across threads.
In phase 2 there is a computation which is not so easily parallelised and which you plan to give to the first thread to finish its share of phase 1.
Personally I'd split this into 2 phases. In the first, use a parallel for loop. In the second drop OpenMP and just have a sequential code. Sort out the load balancing within phase 1 by tuning the arguments to a schedule clause; I'd be tempted to try schedule(guided) first.
If tuning the schedule can't provide the balance you want then investigate replacing parallel for by task-ing.
Do not complicate the code for phase 2 by rolling your own signalling technique. I'm not concerned that the complication will overwhelm you, though you might be concerned about that, but that the complication will fail to deliver any benefits unless you sort out the load balance in phase 1. And when you've done that you don't need to put phase2 inside an OpenMP parallel region.