OpenMP - "#pragma omp critical" importance - c++

So I started using OpenMP (multithreading) to increase the speed of my matrix multiplication and I witnessed weird things: when I turn off OpenMP Support (in Visual Studio 2019) my nested for-loop completes 2x faster. So I removed "#pragma omp critical" to test if it slows down the proccess significantly and the proccess went 4x faster than before (with OpenMP Support On).
Here's my question: is "#pragma omp critical" important in nested loop? Can't I just skip it?
#pragma omp parallel for collapse(3)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
#pragma omp critical
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}

Here's my question: is "#pragma omp critical" important in nested
loop? Can't I just skip it?
If the matrices m, this and A are different you do not need any critical region. Instead, you need to ensure that each thread will write to a different position of the matrix m as follows:
#pragma omp parallel for collapse(2)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
The collapse clause will assign to each thread a different pair (i, j) therefore there will not be multiple threads writing to the same position of the matrix m (i.e., race-condition).

#pragma omp critical is necessary here, as there is a (remote) chance that two threads could write to a particular m.matrix[i][j] value. It hurts performance because only one thread at a time can access that protected assignment statement.
This would likely be better without the collapse part (then you can remove the #pragma omp critical). Accumulate the sums to a temporary local variable, then store it in m.matrix[i][j] after the k loop finishes.

Related

OpenMP For-Loops yield different Results with and without Multithreading

I am new to multithreading and I found the following issue while trying to parallelize some for-loops, in which I manipulate 3D Arrays.
When I run the code using only a single thread, I get the value of E_total I would expect. However when I use the same code with multiple threads and OpenMP, where I set #pragma omp parallel for in the following way
// DO FIRST COMPUTATION STEP ON 3D ARRAY
#pragma omp parallel for
for (size_t ix = 0; ix < N; ix++) {
for (size_t iy = 0; iy < N; iy++) {
for (size_t iz = 0; iz < N; iz++) {
A1[ix][iy][iz] = ...;
}
}
}
// DO SECOND COMPUTATION --AFTER-- FIRST COMPUTATION
#pragma omp parallel for
for (size_t ix = 0; ix < N; ix++) {
for (size_t iy = 0; iy < N; iy++) {
for (size_t iz = 0; iz < N; iz++) {
A2[ix][iy][iz] = ...;
E_pot += something * A1[ix][iy][iz];
E_int += something * A2[ix][iy][iz];
}
}
}
E_total += (E_pot + E_int); // This result changes when 'omp parallel for' is used
I see that I get a different result for E_total.
Since the looped operations are either additive or grid-point specific (independent between different ijk), they should not depend on any ordering inside the loop.
Is it possible that the second for-loop is started before all of the previous first-loop-operations have been finished? If so, how could I prevent that or what other mistakes would I need to watch out for?
Sorry if this is a very basic question, but I could not find related problems online. Thanks in advance!
The problem with this code is that there is a race condition between the different threads. The E_pot and E_int variables are shared between the worker threads and thus the threads are destroying each other's value from time to time.
To fix this, please apply the reduction clause (see Reduction Clauses and Directives in the OpenMP API specification):
// DO SECOND COMPUTATION --AFTER-- FIRST COMPUTATION
#pragma omp parallel for reduction(+:E_pot) reduction(+:E_int)
for (size_t ix = 0; ix < N; ix++) {
for (size_t iy = 0; iy < N; iy++) {
for (size_t iz = 0; iz < N; iz++) {
A2[ix][iy][iz] = ...;
E_pot += something * A1[ix][iy][iz];
E_int += something * A2[ix][iy][iz];
}
}
}
There some more changes that you could look into and see if they help:
Depending on the value of N it might be worth adding a collapse(2) clause (see Worksharing-Loop Construct) to the parallel for directive to merge the two outer loops into a single loop that then runs for N*N iterations. For small N, the thread then can work better, as more iterations can be distributed across the worker threads.
If you add schedule(static) explicitly (it's the default for most OpenMP implementations when you don't say anything, but it's technically not guaranteed), then you can add nowait to the first loop. With that there's no implicit barrier at the end of the first parallel loop and threads that have completed their chunk of work there can proceed to the second loop. The schedule(static) is needed, because then both the first and second loop have the same parallelization and that trick then works. Note: if you added collapse(2) for the first loop, then the second loop also needs to have the collapse(2) so that the parallelization is the same.

What are the differences between ways of writing OpenMP sections?

What (if any) differences are there between using:
#pragma omp parallel
{
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
}
and:
#pragma omp parallel for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
Or does the compiler(ICC) care?
I know that the first one defines a parallel section and than a for loop to be divided up and you can multiple things after the loop. Please do correct me if I'm wrong, still learning the ways of openmp..
But when would you use one way or the other?
Simply put, if you only have 1 for-loop that you want to parallelise use #pragma omp parallel for simd.
If you want to parallelise multiple for-loops or add any other parallel routines before or after the current for-loop, use:
#pragma omp parallel
{
// Other parallel code
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
// Other parallel code
}
This way you don't have to reopen the parallel section when adding more parallel routines, reducing overhead time.

How to optimally parallelize nested loops?

I'm writing a program that should run both in serial and parallel versions. Once I get it to actually do what it is supposed to do I started trying to parallelize it with OpenMP (compulsory).
The thing is I can't find documentation or references on when to use what #pragma. So I am trying my best at guessing and testing. But testing is not going fine with nested loops.
How would you parallelize a series of nested loops like these:
for(int i = 0; i < 3; ++i){
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
switch(i){
case 0:
matrix[j][k].a = matrix[j][k] * someValue1;
break;
case 1:
matrix[j][k].b = matrix[j][k] * someValue2;
break;
case 2:
matrix[j][k].c = matrix[j][k] * someValue3;
break;
}
}
}
}
HEIGHT and WIDTH are usually the same size in the tests I have to run. Some test examples are 32x32 and 4096x4096.
matrix is an array of custom structs with attributes a, b and c
someValue is a double
I know that OpenMP is not always good for nested loops but any help is welcome.
[UPDATE]:
So far I've tried unrolling the loops. It boosts performance but am I adding unnecesary overhead here? Am I reusing threads? I tried getting the id of the threads used in each for but didn't get it right.
#pragma omp parallel
{
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
}
[UPDATE 2]
Apart from unrolling the loop I have tried parallelizing the outer loop (worst performance boost than unrolling) and collapsing the two inner loops (more or less same performance boost as unrolling). This are the times I am getting.
Serial: ~130 milliseconds
Loop unrolling: ~49 ms
Collapsing two innermost loops: ~55 ms
Parallel outermost loop: ~83 ms
What do you think is the safest option? I mean, which should be generally the best for most systems, not only my computer?
The problem with OpenMP is that it's very high-level, meaning that you can't access low-level functionality, such as spawning the thread, and then reusing it. So let me make it clear what you can and what you can't do:
Assuming you don't need any mutex to protect against race conditions, here are your options:
You parallelize your outer-most loop, and that will use 3 threads, and that's the most peaceful solution you're gonna have
You parallelize the first inner loop with, and then you'll have a performance boost only if the overhead of spawning a new thread for every WIDTH element is much smaller the efforts required to perform the most inner loop.
Parallelizing the most inner loop, but this is the worst solution in the world, because you'll respawn the threads 3*HEIGHT times. Never do that!
Not use OpenMP, and use something low-level, such as std::thread, where you can create your own Thread Pool, and push all the operations you want to do in a queue.
Hope this helps to put things in perspective.
Here's another option, one which recognises that distributing the iterations of the outermost loop when there are only 3 of them might lead to very poor load balancing,
i=0
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
i=1
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
i=2
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
Warning -- check the syntax yourself, this is no more than a sketch of manual loop unrolling.
Try combining this and collapsing the j and k loops.
Oh, and don't complain about code duplication, you've told us you're being scored partly on performance improvements.
You probably want to parallelize this example for simd so the compiler can vectorize, collapse the loops because you use j and k only in the expression matrix[j][k], and because there are no dependencies on any other element of the matrix. If nothing modifies somevalue1, etc., they should be uniform. Time your loop to be sure those really do improve your speed.

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

OpenMP nested for, unequal num. of iterations

I am using OpenMP to parallelize loops. In normal case, one would use:
#pragma omp for schedule(static, N_CHUNK)
for(int i = 0; i < N; i++) {
// ...
}
For nested loops, I can put pragma on the inner or outter loop
#pragma omp for schedule(static, N_CHUNK) // can be here...
for(int i = 0; i < N; i++) {
#pragma omp for schedule(static, N_CHUNK) // or here...
for(int k = 0; k < N; k++) {
// both loops have consant number of iterations
// ...
}
}
But! I have two loops, where number of iterations in 2nd loop depends on the 1st loop:
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
What is the best way to balance CPU usage for this kind of loop?
As always:
it depends
profile.
In this case: see also OMP_NESTED environment variable
The things that are going to make the difference here are not being shown:
(non)linear memory addressing (also watch the order of the loops
use of shared variables;
As to your last scenario:
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
I suggest parallelizing the outer loop for the following reasons:
all other things being equal coarse grained parallelizing usually leads to better performance due to
increased cache locality
reduced frequency of locking required
(note that this hinges on assumptions about the loop contents that I can't really make; I'm basing it on my experience of /usual/ parallelized code)
the inner loop might become so short as to be inefficient to parallelize (IOW: the outer loop's range is predictable, the inner loop less so, or doesn't lend itself to static scheduling as well)
nested parallellism rarely scales well
sehe's points -- especially "it depends" and "profile" -- are extremely to the point.
Normally, though, you wouldn't want to have the nested parallel loops as long as the outer loop is big enough to keep all cores busy. The added overhead of another parallel section inside a loop is probably more cost than the benefit from the additional small pieces of work.
The usual way to tackle this is just to schedule the outer loop dynamically, so that the fact that each loop iteration takes a different length of type doesn't cause load-balancing issues (as the i==N-1 iteration completes almost immediately while the i==0 iteration takes forever)
#pragma omp parallel for default(none) shared(N) schedule(dynamic)
for(int i = 0; i < N; i++) {
for(int k = i; k < N; k++) {
// k starts from i, not from 0...
}
}
The collapse pragma is very useful for essentially getting rid of the nesting and is particularly valuable if the outer loop is small (eg, N < num_threads):
#pragma omp parallel for default(none) shared(N) collapse(2)
for(int i = 0; i < N; i++) {
for(int k = 0 ; k < N; k++) {
}
}
This way the two loops are folded into one and there is fewer chunking which means less overhead. But that won't work in this case, because the loop ranges aren't fixed; you can't collapse a loop whose loop bounds change (eg, with i).