Using OpenMP to parallelize a for loop - c++

I'm new to OpenMP. When I parallelize a for loop using
#pragma omp parallel for num_threads(4)
for(i=0;i<4;i++){
//some parallelizable code
}
Is it guaranteed that every thread takes one and only one value of i? How is the loop work divided among the threads in general when num_threads is not equal to or does not evenly divide the total number of times of the for loop? Is there a command I can use to specify that each thread takes only one value of i, or the number of values of i each thread takes?

The work division in a loop construct is decided by the schedule. If no schedule clause is present, the def-sched-var schedule is used, which is implementation defined.
You could use schedule (static, 1), which in your case guarantees that each thread will get exactly one value.
I highly recommend to take a look at the OpenMP specification, Table 2.5 and 2.7.1.1.
There may be legitimate reasons for making this kind of assumptions, but in general the correctness of your loop code should not depend on this. Primarily I would treat this as a performance-hint.
Depending on your use-case you may want to consider tasks or just parallel constructs. If you rely such details for loops, make sure it is well specified in the standard, and not just works in your particular implementation.

Related

Standalone loop collapsing in OpenMP

I have a nested loop that looks as follows -
for(int i=0; i<limit1; i++)
{
for(int j=0; j<limit2; j++){
/*
Code with Custom function calls, branching etc etc.
*/
}
}
The inner and outer loop iterations are independent of each other. Since this loop cannot trivially be vectorized, I would like to collapse it into one huge iteration space so that it can be unrolled more efficiently, in the sense that, instead of unrolling only the inner loop, the combined iteration space can be unrolled.
I do not want to parallelize it. From my understanding, I do not think openmp provides a stand alone collapsing pragma like - #pragma omp collapse(2), rather the collapse clause is used in conjunction with other pragmas like simd or for.
Can acheive this via OpenMP, or maybe some other method as well.
PS - I am not sure about the pros and cons of this method, whether one would get a significant performance increase etc. etc. If someone explain how can one (or for that matter if one can) theoretically(i.e. before benchmarking) ascertain this, it would be great.
TIA
First of all, OpenMP is "a model for parallel programming that is portable across architectures from different vendors" (source: OpenMP 5.1 specification). Thus, if you do not want to parallelize the loops, then it is probably not the right tool.
However, please note that OpenMP provide the directive unroll to fully or partially unrolls the outermost loop of a loop nest. The full clause to fully unroll a loop requires the iteration count (e.g. limit1) to be a compile-time constant (it is not theoretically possible otherwise anyway). The partial clause can be parametrized by an integer to control how many iterations are unrolled. As of now, neither GCC, ICC nor Clang support the two clauses.
There is also the tile directive that could reorder the iterations of your two nested loops so to get 2D fixed-size tiles. It does not specify to the compiler that the loop should be unrolled. Nevertheless, an aggressively-optimizing compiler will likely unroll the fixed-sized tile loop nest generally if it does not contain many branches or function calls (since it is often detrimental for the performance). This directive cannot be used in conjunction with unroll (because the tile loops do not have a canonical loop nest form).
Note that collapsing (manually or automatically using a tool) the two nested loops will certainly not help the compiler if the iteration counts are not a compile-time constants, and typically not if limit2 is not a power of two. Indeed, in such a case, using slow modulus by a variable or a rather-slow branch-less conditional move are certainly required. Unrolling such a loop would be very hard for the compiler (if even possible).

Not able to achieve desired speed up using openmp

I am trying to use openmp directive to parallelize a piece of code but not being able to achieve any speed up. Folowing is the piece of code that I am trying to parllelize:
#pragma omp parallel private(i,j) shared(a,x,n) default(none)
{
for(j=n-1;j>=0;j--)
{
x[j] = A(j,n,n)/A(j,j,n);
#pragma omp for schedule(dynamic)
for (i=0;i<=j-1;i++)
{
A(i,n,n )= A(i,n,n) - A(i,j,n)*x[j];
}
}
}
The value of n is 1000. The A(i,n,n) is defined macro which is used to access to array a.
As I increase the number of threads the execution time increases or it remains the same. The machine I am working on has 4 cores. I am suprised that that there is no speed up even when the number of threads is 2.
I am not able to figure what am I doing wrong?
Since n>>#CPUs (I don't think you have 1000 CPUs), it is not wise to parallelize the inner loop. In your example, you redistribute the work at each iteration.
Instead, it is wiser to parallelize the outer loop. This way, the value of x[j] won't be updated concurrently by different threads (as Zulan mentioned), and you will have much less work re-distribution.
In that case, using dynamic scheduling is wise since the quantity of work change at each iteration.
Note: You will have to change the order of the calculation, the current implementation does not allow you to move the parallelization to the outer loop since all of the threads will update the same value (A[i][n][n]).
Although it is true that threads creating take time, the threads are not recreated at each iteration. They are only created once on the top #pargma statement and running concurrently for the entire following clause.

Performance problems using OpenMP in nested loops

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}
You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/
I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.
The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset

Signaling in OpenMP

I am writing computational code that more-less has the following schematic:
#pragma omp parallel
{
#pragma omp for nowait
// Compute elements of some array A[i] in parallel
#pragma omp single
for (i = 0; i < N; ++i) {
// Do some operation with A[i].
// This time it is important that operations are sequential. e.g.:
result = compute_new_result(result, A[i]);
}
}
Both computing A[i] and compute_new_result are rather expensive. So my idea is to compute the array elements in parallel and if any of the threads gets free, it starts doing sequential operations. There is a good chance that the starting array elements are already computed and the others will be provided by the other threads doing still the first loop.
However, to make the concept work I have to achieve two things:
To make OpenMP split the loops in alternative way, i.e. for two threads: thread 1 computing A[0], A[2], A[4] and thread 2: A[1], A[3], A[5], etc.
To provide some signaling system. I am thinking about an array of flags indicating that A[i] has already been computed. Then compute_new_result should wait for the flag for respective A[i] to be released before proceeding.
I would be glad for any hints how to achieve both goals. I need the solution to be portable across Linux, Windows and Mac. I am writing the whole code in C++11.
Edit:
I have figured out the answer to the fist question. It looks like it is sufficient do add schedule(static,1) clause to the #pragma omp for directive.
However, I am still thinking on the elegant solution of the second issue...
If you don't mind replacing the OpenMP for worksharing construct with a loop that generates tasks instead, you can use OpenMP task to implement both parts of your application.
In the first loop you would create (instead of the loop chunks), tasks that take on the compute load of the iterations. Each iteration of the second loop then also becomes an OpenMP task. The important part then will be to syncronize the tasks between the different phases.
For that you can use task dependencies (introduce with OpenMP 4.0):
#pragma omp task depend(out:A[0])
{ A[0] = a(); }
#pragma omp task depend(in:A[0])
{ b(A[0]); }
Will make sure that task b does not run before task a has completed.
Cheers,
-michael
This is probably an extended comment rather than an answer ...
So, you have a two-phase computation. In phase 1 you can compute, independently, each entry in your array A. It is straightforward therefore to parallelise this using an OpenMP parallel for loop. But there is an issue here, naive allocations of work to threads are likely to lead to a (severely ?) unbalanced load across threads.
In phase 2 there is a computation which is not so easily parallelised and which you plan to give to the first thread to finish its share of phase 1.
Personally I'd split this into 2 phases. In the first, use a parallel for loop. In the second drop OpenMP and just have a sequential code. Sort out the load balancing within phase 1 by tuning the arguments to a schedule clause; I'd be tempted to try schedule(guided) first.
If tuning the schedule can't provide the balance you want then investigate replacing parallel for by task-ing.
Do not complicate the code for phase 2 by rolling your own signalling technique. I'm not concerned that the complication will overwhelm you, though you might be concerned about that, but that the complication will fail to deliver any benefits unless you sort out the load balance in phase 1. And when you've done that you don't need to put phase2 inside an OpenMP parallel region.

Proper use of "atomic directive" to lock STL container

I have a large number of sets of integers, which I have, in turn, put into a vector of pointers. I need to be able to update these sets of integers in parallel without causing a race condition. More specifically. I am using OpenMP's "parallel for" construct.
For dealing with shared resources, OpenMP offers a handy "atomic directive," which allows one to avoid a race condition on a specific piece of memory without using locks. It would be convenient if I could use the "atomic directive" to prevent simultaneous updating to my integer sets, however, I'm not sure whether this is possible.
Basically, I want to know whether the following code could lead to a race condition
vector< set<int>* > membershipDirectory(numSets, new set<int>);
#pragma omp for schedule(guided,expandChunksize)
for(int i=0; i<100; i++)
{
set<int>* sp = membershipDirectory[rand()];
#pragma omp atomic
sp->insert(45);
}
Note that I use a random integer for the index, because in my application, any thread might access any index (there is a random element in my larger application, but I need not go into details).
I have seen a similar example of this for incrementing an integer, but I'm not sure whether it works when working with a pointer to a container as in my case.
After searching around, I found the OpenMP C and C++ API manual on openmp.org, and in section 2.6.4, the limitations of the atomic construct are described.
Basically, the atomic directive can only be used with the following operators:
Unary:
++, -- (prefix and postfix)
Binary:
+,-,*,/,^,&,|,<<,>>
So I will just use locks!
(In some situations critical sections might be preferable, but in my case locks will provide fine grained access to the shared resource, yielding better performance than a critical section.)
you should not use atomic where expression is a function call, it only applies to simple expressions (with possibly built-ins: power, square root).
Instead use critical section (either named or default)
Your code is not clear. Assuming that membershipDirectory[5] is actually membershipDirectory[i], atomic directive is not needed. For two processors, for example, OpenMP produces two threads, one handles i = 0-49, another 50-99 intervals. In this case, there is no need to protect membershipDirectory[i]. atomic directive is required to protect some common resource which does not depend on the loop index, for example, total sum.