Deadlock on parallel loop - c++

I'm trying to parallelize the code below. It's easy to see that there is a dependency between the values of aux, since they are computed after the inner loop, but they are needed inside that inner loop (note that on the first iteration j = 0, the code inside the inner loop is not executed). On the other hand, there is no dependency between the values of mu because we only update mu[k], but the only values needed for other computations are in mu[j], for 0 <= j < k.
My approach consists in having the elements of aux locked until they are computed. As soon as a given value of aux is computed, the lock of that element is released and every thread can use it. However, with this code a deadlock occurs and I can't figure out why. Does someone have any tips?
Thanks
for (j = 0; j < k; ++j)
locks[j] = 0;
#pragma omp parallel for num_threads(N_THREADS) private(j, i)
for (j = 0; j < k; ++j)
{
vals[j] = (long)0;
for (i = 0; i < j; i++)
{
while(!locks[i]);
vals[j] += mu[j][i] * aux[i];
}
aux[j] = (s[j] - vals[j]);
locks[j] = 1;
mu[k][j] = aux[j] / c[j];
}

Does it also hang when not optimized?
In optimized code, gcc would not bother reading locks[i] more than once, so this:
for (i = 0; i < j; i++) {
while(!locks[i]);
would be like writing:
for (i = 0; i < j; i++) {
if( !locks[i] ) for(;;) {}
Try adding a barrier to force gcc to re-read locks[i]:
#define pause() do { asm volatile("pause;":::"memory"); } while(0)
...
for (i = 0; i < j; i++) {
while(!locks[i]) pause();
HTH

Related

Difference between the several ways to parallelize nested for loops in C, C++ using OpenMP

I've just started studying parallel programming with OpenMP, and there is a subtle point in the nested loop. I wrote a simple matrix multiplication code, and checked the result that is correct. But actually there are several ways to parallelize this for loop, which may be different in terms of low-level detail, and I wanna ask about it.
At first, I wrote code below, which multiply two matrix A, B and assign the result to C.
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
#pragma omp parallel for reduction(+:sum)
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
It works, but it takes really long time. And I find out that because of the location of parallel directive, it will construct the parallel region N2 time. I found it by huge increase in user time when I used linux time command.
Next time, I tried code below which also worked.
#pragma omp parallel for private(i, j, k, sum)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
And the elapsed time is decreased from 72.720s in sequential execution to 5.782s in parallel execution with the code above. And it is the reasonable result because I executed it with 16 cores.
But the flow of the second code is not easily drawn in my mind. I know that if we privatize all loop variables, the program will consider that nested loop as one large loop with size N3. It can be easily checked by executing the code below.
#pragma omp parallel for private(i, j, k)
for(i = 0; i < N; i++)
{
for(j = 0; j < N; j++)
{
for(k = 0; k < N; k++)
{
printf("%d, %d, %d\n", i, j, k);
}
}
}
The printf was executed N3 times.
But in my second matrix multiplication code, there is sum right before and after the innermost loop. And It bothers me to unfold the loop in my mind easily. The third code I wrote is easily unfolded in my mind.
To summarize, I want to know what really happens behind the scene in my second matrix multiplication code, especially with the change of the value of sum. Or I'll really thank you for some recommendation of tools to observe the flow of multithreads program written with OpenMP.
omp for by default only applies to the next direct loop. The inner loops are not affected at all. This means, your can think about your second version like this:
// Example for two threads
with one thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = 0; i < N / 2; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
with the other thread execute
{
// declare private variables "locally"
int i, j, k;
for(i = N / 2; i < N; i++) // loop range changed
{
for(j = 0; j < N; j++)
{
sum = 0;
for(k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
}
You can simply all reasoning about variables with OpenMP by declaring them as locally as possible. I.e. instead of the explicit declaration use:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
{
int sum = 0;
for(int k = 0; k < N; k++)
{
sum += A[i][k]*B[k][j];
}
C[i][j] = sum;
}
}
This way you the private scope of variable more easily.
In some cases it can be beneficial to apply parallelism to multiple loops.
This is done by using collapse, i.e.
#pragma omp parallel for collapse(2)
for(int i = 0; i < N; i++)
{
for(int j = 0; j < N; j++)
You can imagine this works with a transformation like:
#pragma omp parallel for
for (int ij = 0; ij < N * N; ij++)
{
int i = ij / N;
int j = ij % N;
A collapse(3) would not work for this loop because of the sum = 0 in-between.
Now is one more detail:
#pragma omp parallel for
is a shorthand for
#pragma omp parallel
#pragma omp for
The first creates the threads - the second shares the work of a loop among all threads reaching this point. This may not be of importance for the understanding now, but there are use-cases for which it matters. For instance you could write:
#pragma omp parallel
for(int i = 0; i < N; i++)
{
#pragma omp for
for(int j = 0; j < N; j++)
{
I hope this sheds some light on what happens there from a logical point of view.

How can I circle back to the a starting point using a for loop?

I have a std::vector<std::unique_ptr<object>> myObjects_ptrs. I need to, starting in one of my objects, circle back again to where I started.
I am doing this as follows:
while(true)
{
for(int i = 0; i < myObjects_ptrs.size(); ++i)
{
myObjects_ptr[i]->doSomething();
//and here I need to circle back
for(int j = i + 1; j < myObjects_ptr.size(); ++j)
{
//do some things with each other object
}
for(int j = 0; j < i; ++j)
{
//do the same things with the rest of the objects
}
}
}
Is this the standard way of doing that? My problem is that once I detect something, then I dont need to keep going around. For example if I find something during the first loop then there is no need to go through the second loop. I con solve this by adding an extra if before the second loop; but is there a better way?
You could use a modulus, i.e. the two inner loops would become:
int numObjects = myObjects_ptr.size();
for (int j = i + 1; j < numObjects + i + 1; ++j)
{
// Get object
auto& obj = myObjects_ptr[j % numObjects];
}
You could replace the two inner loops with something like this:
for(int j = i + 1;; j++)
{
j %= myObjects_ptr.size();
if (j == i)
{
break;
}
// Do stuff
}

Where should the parallel region start in OpenMP?

I'm trying to learn OpenMP, but the professor moved on to a different subject and I feel like I haven't learned a whole lot (or understood).
After looking at some solved questions here on SO I wrote this bit of code:
Working code now looks like this:
void many_iterations()
{
int it, i, j;
for (it = 0; it < NUM_ITERATIONS; it++)
{
#pragma omp parallel
{
#pragma omp for private(j)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
{
if (i == j) B[i][j] = A[i][j] * 2;
else B[i][j] = A[i][j] * 3;
}
}
int **aux = A;
A = B; B = aux;
}
}
I also wrote a serial version (without the #pragma omp bits) and noticed that this version does not actually properly work (outputing A is different between the serial and this version). I then managed to change the two inner for loops to this working bit (correct output as far as I can tell):
for (index = 0; index < N * M; index++)
{
int i = index / M, j = index % M;
// rest of code here
This one does work, but I ran into a problem: running on two threads, it is just as fast as the serial version (with 2 inner fors) and when I tried running this with only one thread the execution time was a lot slower.
Reading online I understood that the parallel section should somehow start before the main for so that it reduces the overhead, but again, my output (A) is wrong.
So my issues are:
How do I set #pragma omp parallel before the first for without ruining the code?
Why is the serial version equal to the 2-thread version of the code with collapsed for loops?
How should I make the code actually more efficient when running on multiple threads?
As a side note, I tried running the serial version with collapsed for loops and I got it to run a lot slower (just like the "parallel" version with 1 thread).
Edit: Trying to use #pragma omp parallel before the it loop:
void many_iterations()
{
int it, i, j;
#pragma omp parallel
{
for (it = 0; it < NUM_ITERATIONS; it++)
{
#pragma omp for private(j)
for (i = 0; i < N; i++)
for (j = 0; j < M; j++)
{
if (i == j) B[i][j] = A[i][j] * 2;
else B[i][j] = A[i][j] * 3;
}
#pragma omp single
{
int **aux = A;
A = B; B = aux;
}
}
}
}

OpenMP: Nested for-loop, barely any difference in execution time

I am doing some image processing and have a nested for loop. I want to implement multiprocessing using OpenMP. The for loop looks like this, where I have added the pragma tags and declared some of the variables private as well.
int a,b,j, idx;
#pragma omp parallel for private(b,j,sumG,sumGI)
for(a = 0; a < ny; ++a)
{
for(b = 0; b < nx; ++b)
{
idx = a*ny+b;
if (imMask[idx] == 0)
{
Wshw[idx] = 0;
continue;
}
sumG = 0;
sumGI = 0;
for(j = a; j < ny; ++j)
{
sumG += shadowM[j-a];
sumGI += shadowM[j-a] * imBlurred[nx*j + b];
}
Wshw[idx] = sumGI / sumG;
}
}
The size of both nx and ny is large and I thought that, using OpenMP, I would get a descent decrease in execution time, instead there is almost no difference. Am I doing something wrong when I implement the multi-threading maybe?
You have a race conditon in idx. You need to make it private as well.
However, instead you could try something like this.
int a,b,j, idx;
#pragma omp parallel for private(a,b,j,sumG,sumGI)
for(idx=0; idx<ny*nx; ++idx) {
if (imMask[idx] == 0)
{
Wshw[idx] = 0;
continue;
}
sumG = 0;
sumGI = 0;
a=idx/ny;
b=idx%ny;
for(j = a; j < ny; ++j) {
sumG += shadowM[j-a];
sumGI += shadowM[j-a] * imBlurred[nx*j + b];
}
Wshw[idx] = sumGI / sumG;
}
You might be able to simiply the inner loop as well as a functcion of idx instead a and b.

C++ loop unfolding, bounds

I have a loop that I want to unfold:
for(int i = 0; i < N; i++)
do_stuff_for(i);
Unfolded:
for(int i = 0; i < N; i += CHUNK) {
do_stuff_for(i + 0);
do_stuff_for(i + 1);
...
do_stuff_for(i + CHUNK-1);
}
But, I should make sure that I do not run out of the original N, like when N == 14 and CHUNK == 10. My question is: what is the best/fasters/standard/most elegant (you name it) way to do it?
One solution that comes is:
int i;
for(i = 0; i < (N % CHUNK); i++)
do_stuff_for(i);
for(i; i < N; i += CHUNK) {
// unfolded, for the rest
}
But maybe there is a better practice
You could use a switch-case.
It's called Duff's Device