openMP nested parallel for loops vs inner parallel for - c++

If I use nested parallel for loops like this:
#pragma omp parallel for schedule(dynamic,1)
for (int x = 0; x < x_max; ++x) {
#pragma omp parallel for schedule(dynamic,1)
for (int y = 0; y < y_max; ++y) {
//parallelize this code here
}
//IMPORTANT: no code in here
}
is this equivalent to:
for (int x = 0; x < x_max; ++x) {
#pragma omp parallel for schedule(dynamic,1)
for (int y = 0; y < y_max; ++y) {
//parallelize this code here
}
//IMPORTANT: no code in here
}
Is the outer parallel for doing anything other than creating a new task?

If your compiler supports OpenMP 3.0, you can use the collapse clause:
#pragma omp parallel for schedule(dynamic,1) collapse(2)
for (int x = 0; x < x_max; ++x) {
for (int y = 0; y < y_max; ++y) {
//parallelize this code here
}
//IMPORTANT: no code in here
}
If it doesn't (e.g. only OpenMP 2.5 is supported), there is a simple workaround:
#pragma omp parallel for schedule(dynamic,1)
for (int xy = 0; xy < x_max*y_max; ++xy) {
int x = xy / y_max;
int y = xy % y_max;
//parallelize this code here
}
You can enable nested parallelism with omp_set_nested(1); and your nested omp parallel for code will work but that might not be the best idea.
By the way, why the dynamic scheduling? Is every loop iteration evaluated in non-constant time?

NO.
The first #pragma omp parallel will create a team of parallel threads and the second will then try to create for each of the original threads another team, i.e. a team of teams. However, on almost all existing implementations the second team has just only one thread: the second parallel region is essentially not used. Thus, your code is more like equivalent to
#pragma omp parallel for schedule(dynamic,1)
for (int x = 0; x < x_max; ++x) {
// only one x per thread
for (int y = 0; y < y_max; ++y) {
// code here: each thread loops all y
}
}
If you don't want that, but only parallelise the inner loop, you can do this:
#pragma omp parallel
for (int x = 0; x < x_max; ++x) {
// each thread loops over all x
#pragma omp for schedule(dynamic,1)
for (int y = 0; y < y_max; ++y) {
// code here, only one y per thread
}
}

Related

openmp two consecutive loops, problem with reduction clause

There is two consecutive loops and there is a reduction clause in the second loop.
#pragma opm parallel
{
#pragma omp for
for (size_t i = 0; i < N; ++i)
{
}
#pragma omp barrier
#pragma omp for reduction(+ \
: sumj)
for (size_t i = 0; i < N; ++i)
{
sumj = 0.0;
for (size_t j = 0; j < adjList[i].size(); ++j)
{
sumj += 0;
}
Jac[i, i] = sumj;
}
}
to reduce the creating threads overhead I wand to keep the threads and use them in the second loop, but I get the following error
lib.cpp:131:17: error: reduction variable ‘sumj’ is private in outer context
#pragma omp for reduction(+ \
^~~
how to fix that?
I'm not sure what you are trying to do, but it seems that something like this would do what you expect:
#pragma omp parallel
{
#pragma omp for
for (size_t i = 0; i < N; ++i)
{
}
#pragma omp barrier
#pragma omp for
for (size_t i = 0; i < N; ++i)
{
double sumj = 0.0;
for (size_t j = 0; j < adjList[i].size(); ++j)
{
sumj += 0;
}
Jac[i, i] = sumj;
}
}
Reduce would be useful in the case of an "omp for" in the interior loop.

Efficient, parallel tensor contraction with vectorized data

The time-determining step of my code is a tensor contraction of the following form
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < no; ++i){
for(int j = 0; j < no; ++j){
X.middleCols(i*nv,nv) += Y.middleCols(j*nv,nv) * this->getIJMatrix(j,i);
}
}
where X and Y are large matrices of dimension (nx,no*nv) and the function getIJMatrix(j,i) returns the (nv*nv) matrix for index pair ij of a rank-four tensor. Also, no < nv << nx. The parallelization here is straigthforward. However, I can exploit symmetry with respect to i and j
#pragma omp parallel for schedule(dynamic)
for(int i = 0; i < no; ++i){
for(int j = i; j < no; ++j){
auto ij = this->getIJMatrix(j,i);
X.middleCols(i*nv,nv) += Y.middleCols(j*nv,nv) * ij;
if(i!=j) X.middleCols(j*nv,nv) += Y.middleCols(i*nv,nv) * ij.transpose();
}
}
leaving me with a race condition. Since X is large, using a reduction here is not feasible.
If I understand it correctly, there is no way around each thread waiting for the other ones within the inner loop. What's a good practice for this which preferably is as fast as possible?
edit: corrected obvious errors

Difference between the several ways to parallelize nested for loops in C, C++ using OpenMP

I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop. I wrote a simple matrix multiplication code, and checked the result that is correct. But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.
At first, I wrote code below, which multiply two matrix A, B and assign the result to C.
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
#pragma omp parallel for reduction(+:sum)
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
It works, but it takes really long time. And I find out that because of the location of parallel directive, it will construct the parallel region N2 time. I found it by huge increase in user time when I used linux time command.
Next time, I tried code below which also worked.
#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above. And it is the reasonable result because I executed it with 16 cores.
But the flow of the second code is not easily drawn in my mind. I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N3. It can be easily checked by executing the code below.
#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
for(k = 0; k < N; k++)
{
printf("%d, %d, %d\n", i, j, k);
}
}
}
The printf was executed N3 times.
But in my second matrix multiplication code, there is sum right before and after the innermost loop. And It bothers me to unfold the loop in my mind easily. The third code I wrote is easily unfolded in my mind.
To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum. Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.
omp for by default only applies to the next direct loop. The inner loops are not affected at all. This means, your can think about your second version like this:
// Example for two threads
with one thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = 0; i < N / 2; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
with the other thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = N / 2; i < N; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
You can simply all reasoning about variables with OpenMP by declaring them as locally as possible. I.e. instead of the explicit declaration use:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
int sum = 0;
for(int k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
This way you the private scope of variable more easily.
In some cases it can be beneficial to apply parallelism to multiple loops.
This is done by using collapse, i.e.
#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
You can imagine this works with a transformation like:
#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
int i = ij / N;
int j = ij % N;
A collapse(3) would not work for this loop because of the sum = 0 in-between.
Now is one more detail:
#pragma omp parallel for
is a shorthand for
#pragma omp parallel
#pragma omp for
The first creates the threads - the second shares the work of a loop among all threads reaching this point. This may not be of importance for the understanding now, but there are use-cases for which it matters. For instance you could write:
#pragma omp parallel
for(int i = 0; i < N; i++)
{
#pragma omp for
for(int j = 0; j < N; j++)
{
I hope this sheds some light on what happens there from a logical point of view.

openmp shared or nothing. private vs uninitialized

Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
#pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
and the same code but line 4 doesn't have the shared(sum) because sum is already initialized?
#pragma omp parallel for
for(int = 0; ....)
Same question for private in openmp:
Is
void work(float* c, int N)
{
float x, y; int i;
#pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
the same as without the private(x,y) because x and y aren't initialized?
#pragma omp parallel for
Is there a difference between these two implementations of openmp?
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for shared(sum)
for (int i = 0; i < N; i++) {
#pragma omp critical
sum += a[i] * b[i];
}
return sum;
}
In openMP a variable declared outside the parallel scope is shared, unless it is explicitly rendered private.
Hence the shared declaration can be omitted.
But your code is far from being optimal. It works, but will be by far slower than its sequential counterpart, because critical will force sequential processing and creating a critical section has an important temporal cost.
The proper implementation would use a reduction.
float dot_prod (float* a, float* b, int N)
{
float sum = 0.0;
# pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += a[i] * b[i];
}
return sum;
}
The reduction creates a hidden local variable to accumulate in parallel in every thread and before thread destruction performs an atomic addition of these local sums on the shared variable sum.
Same question for private in openmp:
void work(float* c, int N)
{
float x, y; int i;
# pragma omp parallel for private(x,y)
for (i = 0; i < N; i++)
{
x = a[i]; y = b[i];
c[i] = x + y;
}
}
By default, x and y are shared. So without private the behaviour will be different (and buggy because all threads will modify the same globally accessible vars x and y without an atomic access).
the same as without the private(x,y) because x and y aren't initialized?
Initialization of x and y does not matter, what is important is where they are declared. To insure proper behavior, they must be rendered private and the code will be correct as xand y are set before been used in the loop.

Where should the parallel region start in OpenMP?

I'm trying to learn OpenMP, but the professor moved on to a different subject and I feel like I haven't learned a whole lot (or understood).
After looking at some solved questions here on SO I wrote this bit of code:
Working code now looks like this:
void many_iterations()
{
int it, i, j;
for (it = 0; it < NUM_ITERATIONS; it++)
{
#pragma omp parallel
{
#pragma omp for private(j)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
{
if (i == j) B[i][j] = A[i][j] * 2;
else B[i][j] = A[i][j] * 3;
}
}
int **aux = A;
A = B; B = aux;
}
}
I also wrote a serial version (without the #pragma omp bits) and noticed that this version does not actually properly work (outputing A is different between the serial and this version). I then managed to change the two inner for loops to this working bit (correct output as far as I can tell):
for (index = 0; index < N * M; index++)
{
int i = index / M, j = index % M;
// rest of code here
This one does work, but I ran into a problem: running on two threads, it is just as fast as the serial version (with 2 inner fors) and when I tried running this with only one thread the execution time was a lot slower.
Reading online I understood that the parallel section should somehow start before the main for so that it reduces the overhead, but again, my output (A) is wrong.
So my issues are:
How do I set #pragma omp parallel before the first for without ruining the code?
Why is the serial version equal to the 2-thread version of the code with collapsed for loops?
How should I make the code actually more efficient when running on multiple threads?
As a side note, I tried running the serial version with collapsed for loops and I got it to run a lot slower (just like the "parallel" version with 1 thread).
Edit: Trying to use #pragma omp parallel before the it loop:
void many_iterations()
{
int it, i, j;
#pragma omp parallel
{
for (it = 0; it < NUM_ITERATIONS; it++)
{
#pragma omp for private(j)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
{
if (i == j) B[i][j] = A[i][j] * 2;
else B[i][j] = A[i][j] * 3;
}
#pragma omp single
{
int **aux = A;
A = B; B = aux;
}
}
}
}