OpenMP/C++: Parallel for loop with reduction afterwards - best practice? - c++

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)

You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

Related

atomic inside a single construct

In an openMP framework, suppose I have a series of tasks that should be done by a single task. Each task is different, so I cannot fit into a #pragma omp for construct. Inside the single construct, each task updates a variable shared by all tasks. How can I protect the update of such a variable?
A simplified example:
#include <vector>
struct A {
std::vector<double> x, y, z;
};
int main()
{
A r;
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
#pragma omp barrier
return 0;
}
The code lines below // DANGER are problematic because they modify the memory contents of a shared variable.
In the example above, it might be that it still works without issues, because I am effectively modifying different members of r. Still the problem is: how can I make sure that tasks do not simultaineusly update r? Is there a "sort-of" atomic pragma for the single construct?
There is no data race in your original code, because x,y, and z are different vectors in struct A (as already emphasized by #463035818_is_not_a_number), so in this respect you do not have to change anything in your code.
However, a #pragma omp parallel directive is missing in your code, so at the moment it is a serial program. So, it should look like this:
#pragma omp parallel num_threads(3)
{
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
}
In this case #pragma omp barrier is not necessary as there is an implied barrier at the end of parallel region. Note that I have used num_threads(3) clause to make sure that only 3 threads are assigned to this parallel region. If you skip this clause then all other threads just wait at the barrier.
In the case of an actual data race (i.e. more than one single region/section changes the same struct member), you can use #pragma omp critical (name) to rectify this. But keep in mind that this kind of serialization can negate the benefits of multithreading when there is not enough real parallel work beside the critical section.
Note that, a much better solution is to use #pragma omp sections (as suggested by #PaulG). If the number of tasks to run parallel is known at compile time sections are the typical choice in OpenMP:
#pragma omp parallel sections
{
#pragma omp section
{
//Task 1 here
}
#pragma omp section
{
//Task 2
}
#pragma omp section
{
// Task 3
}
}
For the record, I would like to show that it is easy to do it by #pragma omp for as well:
#pragma omp parallel for
for(int i=0;i<3;i++)
{
if (i==0)
{
// Task 1
} else if (i==1)
{
// Task 2
}
else if (i==2)
{
// Task 3
}
}
each task updates a variable shared by all tasks.
Actually they don't. Consider you rewrite the code like this (you don't need the temporary vectors):
void foo( std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
x.push_back(i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
y.push_back(i * i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
z.push_back(i * i + 2);
}
#pragma omp barrier
}
As long as the caller can ensure that x, y and z do not refer to the same object, there is no data race. Each part of the code modifies a seperate vector. No synchronization needed.
Now, it does not matter where those vectors come from. You can call the function like this:
A r;
foo(r.x, r.y, r.z);
PS: I am not familiar with omp anymore, I assumed the annotations correctly do what you want them to do.

OpenMP - "#pragma omp critical" importance

So I started using OpenMP (multithreading) to increase the speed of my matrix multiplication and I witnessed weird things: when I turn off OpenMP Support (in Visual Studio 2019) my nested for-loop completes 2x faster. So I removed "#pragma omp critical" to test if it slows down the proccess significantly and the proccess went 4x faster than before (with OpenMP Support On).
Here's my question: is "#pragma omp critical" important in nested loop? Can't I just skip it?
#pragma omp parallel for collapse(3)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
#pragma omp critical
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
Here's my question: is "#pragma omp critical" important in nested
loop? Can't I just skip it?
If the matrices m, this and A are different you do not need any critical region. Instead, you need to ensure that each thread will write to a different position of the matrix m as follows:
#pragma omp parallel for collapse(2)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
The collapse clause will assign to each thread a different pair (i, j) therefore there will not be multiple threads writing to the same position of the matrix m (i.e., race-condition).
#pragma omp critical is necessary here, as there is a (remote) chance that two threads could write to a particular m.matrix[i][j] value. It hurts performance because only one thread at a time can access that protected assignment statement.
This would likely be better without the collapse part (then you can remove the #pragma omp critical). Accumulate the sums to a temporary local variable, then store it in m.matrix[i][j] after the k loop finishes.

What are the differences between ways of writing OpenMP sections?

What (if any) differences are there between using:
#pragma omp parallel
{
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
}
and:
#pragma omp parallel for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
Or does the compiler(ICC) care?
I know that the first one defines a parallel section and than a for loop to be divided up and you can multiple things after the loop. Please do correct me if I'm wrong, still learning the ways of openmp..
But when would you use one way or the other?
Simply put, if you only have 1 for-loop that you want to parallelise use #pragma omp parallel for simd.
If you want to parallelise multiple for-loops or add any other parallel routines before or after the current for-loop, use:
#pragma omp parallel
{
// Other parallel code
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
// Other parallel code
}
This way you don't have to reopen the parallel section when adding more parallel routines, reducing overhead time.

replacement for #pragma omp critical (C++)

I am using openMp on a nested loop which works like this
#pragma omp parallel shared(vector1) private(i,j)
{
#pragma omp for schedule(dynamic)
for (i = 0; i < vector1.size(); ++i){
//some code here
for (j = 0; j < vector1.size(); ++j){
//some other code goes here
#pragma omp critical
A+=B;
}
C +=A;
}
}
the Problem here is that my code is doing a lot of the computation in the A+=B part of the code. Therefore by making it critical, I am not achieving the speedup I would like. (In fact there appears to be some overhead since my program is taking longer to execute then it being sequentially written).
I tried using
#pragma omp reduction private(B) reduction(+:A)
A+=B
this speeds up the execution time however is seems that it does not take care of race conditions like the critical clause since I am not getting the same results of A.
Is there an alternative to this i can try?
Unless you want to go through the trouble of making your Vector3 class thread-safe or rewriting your operations for use with an std::atomic<Vector3>, both of which would still suffer from performance drawbacks (although not as serious as using a critical section), you can actually mimic the behaviour of OpenMP reduction:
#pragma omp parallel // no need to declare variables declared outside/inside as shared/private
{
Vector3 A{}, LocalC{}; // both thread-private
#pragma omp for schedule(dynamic)
for (i = 0; i < vector1.size(); ++i){
//some code here
for (j = 0; j < vector1.size(); ++j){
//some other code goes here
A += B; // does not need a barrier
}
LocalC += A; // does not need a barrier
}
#pragma omp critical
C += LocalC;
}
NB that this assumes that you don't access A for reading within your "some code" comments, but you shouldn't anyway if you ever thought of using a reduction clause.

Influence on the static scheduling overhead in OpenMP

I thought about which factors would influence the static scheduling overhead in OpenMP.
In my opinion it is influenced by:
CPU performance
specific implementation of the OpenMP run-time library
the number of threads
But am I missing further factors? Maybe the size of the tasks, ...?
And furthermore: Is the overhead linearly dependent on the number of iterations?
In this case I would expect that having static scheduling and 4 cores, the overhead increases linearly with 4*i iterations. Correct so far?
EDIT:
I am only interested in the static (!) scheduling overhead itself. I am not talking about thread start-up overhead and time spent in synchronisation and critical section overhead.
You need to separate the overhead for OpenMP to create a team/pool of threads and the overhead for each thread to operate separate sets of iterators in a for loop.
Static scheduling is easy to implement by hand (which is sometimes very useful). Let's consider what I consider the two most important static scheduling schedule(static) and schedule(static,1) then we can compare this to schedule(dynamic,chunk).
#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++) foo(i);
is equivalent to (but not necessarily equal to)
#pragma omp parallel
{
int start = omp_get_thread_num()*N/omp_get_num_threads();
int finish = (omp_get_thread_num()+1)*N/omp_get_num_threads();
for(int i=start; i<finish; i++) foo(i);
}
and
#pragma omp parallel for schedule(static,1)
for(int i=0; i<N; i++) foo(i);
is equivalent to
#pragma omp parallel
{
int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
for(int i=ithread; i<N; i+=nthreads) foo(i);
}
From this you can see that it's quite trivial to implement static scheduling and so the overhead is negligible.
On the other hand if you want to implement schedule(dynamic) (which is the same as schedule(dynamic,1)) by hand it's more complicated:
int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
#pragma omp atomic capture
i = cnt++;
if(i>=N) break;
foo(i);
}
This requires OpenMP >=3.1. If you wanted to do this with OpenMP 2.0 (for MSVC) you would need to use critical like this
int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
#pragma omp critical
i = cnt++;
if(i>=N) break;
foo(i);
}
Here is an equivalent to schedule(dynamic,chunk) (I have not optimized this using atomic accesss):
int cnt = 0;
int chunk = 5;
#pragma omp parallel
{
int start, finish;
do {
#pragma omp critical
{
start = cnt;
finish = cnt+chunk < N ? cnt+chunk : N;
cnt += chunk;
}
for(int i=start; i<finish; i++) foo(i);
} while(finish<N);
}
Clearly using atomic accesses is going to cause more overhead. This also shows why using larger chunks for schedule(dynamic,chunk) can reduce the overhead.