What (if any) differences are there between using:
#pragma omp parallel
{
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
}
and:
#pragma omp parallel for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
Or does the compiler(ICC) care?
I know that the first one defines a parallel section and than a for loop to be divided up and you can multiple things after the loop. Please do correct me if I'm wrong, still learning the ways of openmp..
But when would you use one way or the other?
Simply put, if you only have 1 for-loop that you want to parallelise use #pragma omp parallel for simd.
If you want to parallelise multiple for-loops or add any other parallel routines before or after the current for-loop, use:
#pragma omp parallel
{
// Other parallel code
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
// Other parallel code
}
This way you don't have to reopen the parallel section when adding more parallel routines, reducing overhead time.
Related
So I started using OpenMP (multithreading) to increase the speed of my matrix multiplication and I witnessed weird things: when I turn off OpenMP Support (in Visual Studio 2019) my nested for-loop completes 2x faster. So I removed "#pragma omp critical" to test if it slows down the proccess significantly and the proccess went 4x faster than before (with OpenMP Support On).
Here's my question: is "#pragma omp critical" important in nested loop? Can't I just skip it?
#pragma omp parallel for collapse(3)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
#pragma omp critical
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
Here's my question: is "#pragma omp critical" important in nested
loop? Can't I just skip it?
If the matrices m, this and A are different you do not need any critical region. Instead, you need to ensure that each thread will write to a different position of the matrix m as follows:
#pragma omp parallel for collapse(2)
for (int i = 0; i < this->I; i++)
{
for (int j = 0; j < A.J; j++)
{
m.matrix[i][j] = 0;
for (int k = 0; k < A.I; k++)
{
m.matrix[i][j] += this->matrix[i][k] * A.matrix[k][j];
}
}
}
The collapse clause will assign to each thread a different pair (i, j) therefore there will not be multiple threads writing to the same position of the matrix m (i.e., race-condition).
#pragma omp critical is necessary here, as there is a (remote) chance that two threads could write to a particular m.matrix[i][j] value. It hurts performance because only one thread at a time can access that protected assignment statement.
This would likely be better without the collapse part (then you can remove the #pragma omp critical). Accumulate the sums to a temporary local variable, then store it in m.matrix[i][j] after the k loop finishes.
Is this the correct way to parallel two for loops by using the "#pragma omp single nowait" and "#pragma omp for for two different loops"? Or is there any other way to do it?
#pragma omp single nowait
{
for (i = ; i < N; i += )
{
D[i] = x * A[i] + x * B[i];
}
#pragma omp for
for (i = 0; i < N; i++)
C[i] = c * D[i];
}
} // end omp parallel
You should note that you can actually merge the two for loops into one...since for each i, you compute D[i], and later on, the computation for C[i] is just dependent on D[i].
you should combine the loops that way, and then just use omp for as you did for your second loop.
I'm totally new to openmp and learning how to parallelized loops using task. I made the following loop:
#pragma omp parallel default(none) firstprivate(left) private(i) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
#pragma omp task
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp taskwait
}
I'm not sure if it's parallelized properly as it's taking more time than it's supposed to. How can I improve my code?
In this case, the task directive is totally irrelevant since (#pragma omp for) do the job.
Task is used for unbounded loops.
At the start of #pragma omp parallel a bunch of threads are created, then when we get to #pragma omp for the workload is distributed. What happens if this for loop has a for loop inside it, and I place a #pragma omp for before it as well? Does each thread create new threads? If not, which threads are assigned this task? What exactly happens in this situation?
By default, no threads are spawned for the inner loop. It is done sequentially using the thread that reaches it.
This is because nesting is disabled by default. However, if you enable nesting via omp_set_nested(), then a new set of threads will be spawned.
However, if you aren't careful, this will result in p^2 number of threads (since each of the original p threads will spawn another p threads.) Therefore nesting is disabled by default.
In a situation like the following:
#pragma omp parallel
{
#pragma omp for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
}
what happens is that you trigger an undefined behavior as you violate the OpenMP standard. More precisely you violate the restrictions appearing in section 2.5 (worksharing constructs):
The following restrictions apply to worksharing constructs:
Each worksharing region must be encountered by all threads in a team or by none at all.
The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.
This is clearly shown in the examples A.39.1c and A.40.1c:
Example A.39.1c: The following example of loop construct nesting is conforming because the inner and outer loop regions bind to different parallel
regions:
void work(int i, int j) {}
void good_nesting(int n)
{
int i, j;
#pragma omp parallel default(shared)
{
#pragma omp for
for (i=0; i<n; i++) {
#pragma omp parallel shared(i, n)
{
#pragma omp for
for (j=0; j < n; j++)
work(i, j);
}
}
}
}
Example A.40.1c: The following example is non-conforming because the inner and outer loop regions are closely nested
void work(int i, int j) {}
void wrong1(int n)
{
#pragma omp parallel default(shared)
{
int i, j;
#pragma omp for
for (i=0; i<n; i++) {
/* incorrect nesting of loop regions */
#pragma omp for
for (j=0; j<n; j++)
work(i, j);
}
}
}
Notice that this is different from:
#pragma omp parallel for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp parallel for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
in which you try to spawn a nested parallel region. Only in this case the discussion of Mysticial answer holds.
I'm using OpenMP to improve my program efficiency on loops.
But recently I discovered that on small loops the use of this library decreased performances and that using the normal way was better.
In fact, I'd like to use openMP only if a condition is satisfied, my code is
#pragma omp parallel for
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
But what I want to do is to disable the #pragma if size is small enough i.e.:
if (size > OMP_MIN_VALUE)
#pragma omp parallel for
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
But does not work, the better way is to write the loop twice but I don't want to do that way...
if (size > OMP_MIN_VALUE)
{
#pragma omp parallel for
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
}
else
{
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
}
What is the better way to do that?
I think you should be able to achieve the effect you're looking for by using the optional schedule clause on your parallel for directive:
#pragma omp parallel for schedule(static, OMP_MIN_VALUE)
for (unsigned i = 0; i < size; ++i)
do_some_stuff ();
You might want to play around with different kinds of scheduling though and different chunk sizes to see what suits your library routines best.