How to parallelize updating sums with OpenMP - c++

The following loop iterates over all edges of a graph, determines if the end nodes belong to the same group, and then adds the edge weight to the total edge weight of that group.
// TODO: parallel
FORALL_EDGES_BEGIN((*G)) {
node u = EDGE_SOURCE;
node v = EDGE_DEST;
cluster c = zeta[u];
cluster d = zeta[v];
if (c == d) {
// TODO: critical section
intraEdgeWeight[c] += G->weight(u, v);
} // else ignore edge
} FORALL_EDGES_END();
I would like to parallelize it with OpenMP. I think that the code in the if statement is a critical section that might lead to a race condition and wrong results if threads are interrupted in the middle of it (correct?).
If the += operation could be made atomically, I believe that the problem is solved (correct?). However, I've looked up the atomic directive and there it states that:
Unfortunately, the atomicity setting can only be specified to simple
expressions that usually can be compiled into a single processor
opcode, such as increments, decrements, xors and the such. For
example, it cannot include function calls, array indexing, overloaded
operators, non-POD types, or multiple statements.
What should I use to parallelize this loop correctly?

Actually the accepted syntax for atomic update is:
x++;
x--;
++x;
--x;
x binop= expr;
x = x binop expr;
where x is a scalar l-value expression and expr is any expression, including function calls, with the only restriction being that it must be of scalar type. The compiler would take care to store the intermediate result in a temporary variable.
Sometimes it's better to consult the standards documents instead of reading tutorials on the Internet. Observe the OpenMP 3.1 standard Expample A.22.1c:
float work1(int i)
{
return 1.0 * i;
}
...
#pragma omp atomic update
x[index[i]] += work1(i);

I also think your if block is a critical section, you should not write a vector from multiple threads without serializing the writes. You can use #pragma omp critical to restrict the execution of the += in your if block to a single thread at a time.

You can split the expression in two parts: the function call that assigns the result into a temporary, and the addition of that temporary to the accumulator. In that case, the second expression will be simple enough to use omp atomic, assuming that the addition itself is not a complex overloaded operator for a custom type. Of course you can only do that in case G->weight(u,v) is a thread-safe call, otherwise you have to use omp critical or a mutex.

Related

Increasing array index in openMP

I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...
#pragma omp parallel for
for (j=0;j <m; j++) {
some work;
for (i= 0; i < n ; i++) {
p =b[i];
if (P< 0 && k < m) {
a[k] = c[i]; k++ ;
} else {
x=c[i];
}
}
some work
}
The outer loop is in parallel, and the inner loop updates k. The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k], but the proper order of k is not kept.
Some threads will update k and a[k], and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?
For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.
How do I keep the same order and ensure parallelism at the same time?
Note: this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.
How do I keep the same order and ensure parallelism at the same time?
Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.
Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.
With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k. Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.
It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of "some work" and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.
> Instead of giving a[n]={0,1,2,3,.......n} , it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.
There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n, inclusive, is n * (n + 1) / 2. You do not need a reduction for this.
If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[i] = i;
k += i;
}
That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like
type_of_k k0 = k;
k = 0;
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[k0 + i] = i;
k += k0 + i;
}
Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++.
It sounds like you're essentially filling in a with a filter of entries from c, and want to preserve their order. If this is the only use k has, some other methods spring to mind:
Always write a[i], but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.
Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k, and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.
Even with some sequential dependencies you can do optimizations, e.g. a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). E.g. parallel prefix sum, openmp discussion on stack overflow. This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:
Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.
It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.

Compiler optimization eliminates effects of false sharing. How?

I'm trying to replicate the effects of false sharing using OpenMP as explained in the OpenMP introduction by Tim Mattson.
My program performs a straightforward numerical integration (see the link for the mathematical details) and I've implemented two versions, the first of which is supposed to be cache-friendly having each thread keep a local variable to accumulate its portion of the index space,
const auto num_slices = 100000000;
const auto num_threads = 4; // Swept from 1 to 9 threads
const auto slice_thickness = 1.0 / num_slices;
const auto slices_per_thread = num_slices / num_threads;
std::vector<double> partial_sums(num_threads);
#pragma omp parallel num_threads(num_threads)
{
double local_buffer = 0;
const auto thread_num = omp_get_thread_num();
for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
local_buffer += func(slice * slice_thickness); // <-- Updates thread-exclusive buffer
partial_sums[thread_num] = local_buffer;
}
// Sum up partial_sums to receive final result
// ...
while the second version has each thread update an element in a shared std::vector<double>, causing each write to invalidate the cache lines on all other threads
// ... as above
#pragma omp parallel num_threads(num_threads)
{
const auto thread_num = omp_get_thread_num();
for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
partial_sums[thread_num] += func(slice * slice_thickness); // <-- Invalidates caches
}
// Sum up partial_sums to receive final result
// ...
The problem is that I am unable to see any effects of false sharing whatsoever unless I turn off optimization.
Compiling my code (which has to account for a few more details than the snippets above) using GCC 8.1 without optimization (-O0) yields the results I naively expected while using full optimization (-O3) eliminates any difference in terms of performance between the two versions, as shown in the plot.
What's the explanation for this? Does the compiler actually eliminate false sharing? If not, how come the effect is so small when running the optimized code?
I'm on a Core-i7 machine using Fedora. The plot displays mean values whose sample standard deviations don't add any information to this question.
tl;dr: The compiler optimizes your second version into the first.
Consider the code within the loop of your second implementation - ignoring the OMP/multithreaded aspect of it for a moment.
You have increments of a value within an std::vector - which is necessarily located on the heap (well, up until and including in C++17 anyway). The compiler sees you're adding to a value on the heap within a loop; this is a typical candidate for optimization: It takes the heap access out of the loop, and uses a register as a buffer. It doesn't even need to read from the heap, since they're just additions - so it essentially arrives at your first solution.
See this happening on GodBolt (with a simplified example) - notice how the code for bar1() and bar2() is almost the same, with accumulation happening in registers.
Now, the fact that there's multi-threading and OMP involved doesn't change the above. If you were to use, say, std::atomic<double> instead of double, then it might have changed (and maybe not even then, if the compiler is smart enough).
Notes:
Thanks goes to #Evg for noticing a glaring mistake in the code of a previous version of this answer.
The compiler must be able to know that func() won't also change the value of your vector - or to decide that, for the purposes of addition, it shouldn't really matter.
This optimization could be seen as a Strength Reduction - from an operation on the heap to that on a register - but I'm not sure that term is in use for this case.

Is it possible to multiply initialize using open mp in c++?

I have a for loop which access many memory pointers in each iteration. For each of these memory pointers, I created an index. My problem is that when I try to use open mp to parallelize this loop, I get the following error:
error: expected iteration declaration or initialization
I thought that this error would be one of the following:
-Open MP does not accept increment different than ++ or --
-Open MP does not accept multiple initialization in a loop
For reasons regarding performance, it is important to me to use these multiple indexes. Does anybody know the answer for my problem?
Here it is the code:
#pragma omp parallel default(shared)
{
int tID = omp_get_thread_num();
int i, iCF, iPF, iNF, iPJG, iCJG, iNJG, iPRJG, iCRJG;
##pragma omp for nowait
for(i=0, iCF=0, iPF=0, iNF=sqrBcksDim, iPJG=0, iCJG=0, iNJG=sqrBcksSize, iPRJG=0, iCRJG=0 ; iCF<RHSArraySize ; iPF=iCF, iCF=iNF, iNF+=sqrBcksDim, iPJG=iCJG, iCJG=iNJG, iNJG+=sqrBcksSize, iPRJG=iCRJG, iCRJG+=rectBcksSize, ++i)
{
}
}
Well, looking at that third clause, you’re doing a lot of inherently sequential computations that depend on the program state at the end of the previous iteration of the loop. You could move all of those operations but the += and ++ updates inside the body of the loop, and from the look of things possibly make the loop condition depend on iNF, correct? But some of them look like they still might be ordered. For a parallel algorithm, are there closed-form initializers you could use inside the loop body that depend only on i or something loop-invariant?
If not, and the inputs to each iteration really do depend on the results of previous iterations of the loop, then it’s not a parallel algorithm
One suggestion:
Here’s how I would try to fix this. You can only initialize i and increment it by a constant within the loop; however, you can equivalently move all the rest of those operations inside the loop. For example, I don’t know what else goes on inside the loop body, but if iCF is initialized to 0, iNF to sqrBcksDim and at the end of each iteration, iCF is set to the previous value of iNF and iNF is incremented by sqrBcksDim, it looks like you could rewrite the loop into something like:
int i;
#pragma omp for nowait
for ( i=0; i < RHSArraySize/sqrBcksDim; ++i )
{
const int iCF = i*sqrBcksDim;
const int iNF = iCF + sqrBcksDim;
// ...
}
Can you do that for your other variables? If you really have a parallel algorithm here, you should be able to, because each run of the loop should only depend on i and loop invariants, which you can use in your initializers. You’ll need to declare a variable outside the loop if you’re going to refer to it outside the body of the loop, but for the time being, just declare a new local variable and don’t read any variable outside the loop that you also write to inside the loop. If there are no implicit sequential dependencies, you should be able to initialize them all at the start of the loop body.
You might not end up doing it that way, but it might help you think about how to refactor.

OpenMP Performance impact: private directive vs. declaring variable inside for construct

Performance wise, which of the following is more efficient?
Assigning in the master thread and copying the value to all threads:
int i = 0;
#pragma omp parallel for firstprivate(i)
for( ; i < n; i++){
...
}
Declaring and assigning the variable in each thread
#pragma omp parallel for
for(int i = 0; i < n; i++){
...
}
Declaring the variable in the master thread but assigning it in each thread.
int i;
#pragma omp parallel for private(i)
for(i = 0; i < n; i++){
...
}
It may seem a silly question and/or the performance impact may be negligible. But I'm parallelizing a loop that does a small amount of computation and is called a large number of times, so any optimization I can squeeze out of this loop is helpful.
I'm looking for a more low level explanation and how OpenMP handles this.
For example, if parallelizing for a large number of threads I assume the second implementation would be more efficient, since initializing a variable using xor is far more efficient than copying the variable to all the threads
There is not much of a difference in terms of performance among the 3 versions you presented, since each one of them is using #pragma omp parallel for. Hence, OpenMP will automatically assign each for iteration to different threads. Thus, variable i will became private to each thread, and each thread will have a different range of for iterations to work with. The variable 'i' was automatically set to private in order to avoid race conditions when updating this variable. Since, the variable 'i' will be private on the parallel for anyway, there is no need to put private(i) on the #pragma omp parallel for.
Nevertheless, your first version will produce an error since OpenMP is expecting that the loop right underneath of #pragma omp parallel for have the following format:
for(init-expr; test-expr;incr-expr)
inorder to precompute the range of work.
The for directive places restrictions on the structure of all
associated for-loops. Specifically, all associated for-loops must
have the following canonical form:
for (init-expr; test-expr;incr-expr) structured-block (OpenMP Application Program Interface pag. 39/40.)
Edit: I tested your two last versions, and inspected the generated assembly. Both version produce the same assembly, as you can see -> version 2 and version 3.

OpenMP parallel thread

I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?
OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.