In C++ with openMP, is there any difference between
#pragma omp parallel for
for(int i=0; i<N; i++) {
...
}
and
#pragma omp parallel
for(int i=0; i<N; i++) {
...
}
?
Thanks!
#pragma omp parallel
for(int i=0; i<N; i++) {
...
}
This code creates a parallel region, and each individual thread executes what is in your loop. In other words, you do the complete loop N times, instead of N threads splitting up the loop and completing all iterations just once.
You can do:
#pragma omp parallel
{
#pragma omp for
for( int i=0; i < N; ++i )
{
}
#pragma omp for
for( int i=0; i < N; ++i )
{
}
}
This will create one parallel region (aka one fork/join, which is expensive and therefore you don't want to do it for every loop) and run multiple loops in parallel within that region. Just make sure if you already have a parallel region you use #pragma omp for as opposed to #pragma omp parrallel for as the latter will mean that each of your N threads spawns N more threads to do the loop.
Related
What (if any) differences are there between using:
#pragma omp parallel
{
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
}
and:
#pragma omp parallel for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
Or does the compiler(ICC) care?
I know that the first one defines a parallel section and than a for loop to be divided up and you can multiple things after the loop. Please do correct me if I'm wrong, still learning the ways of openmp..
But when would you use one way or the other?
Simply put, if you only have 1 for-loop that you want to parallelise use #pragma omp parallel for simd.
If you want to parallelise multiple for-loops or add any other parallel routines before or after the current for-loop, use:
#pragma omp parallel
{
// Other parallel code
#pragma omp for simd
for (int i = 0; i < 100; ++i)
{
c[i] = a[i] ^ b[i];
}
// Other parallel code
}
This way you don't have to reopen the parallel section when adding more parallel routines, reducing overhead time.
I have this code:
#pragma omp parallel
{
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
// and so on... up to 5 or 6 of myObject_x
// Then I sum up the buffers and do something with them
float result;
for (int i=0; i<given_number; ++i)
result = myBuffer_1[i] + myBuffer_2[i];
// do something with result
If I run this code, I get what I expect but the CPU usage looks quite high. Instead, if I run it normally without OpenMP I get the same results but the CPU usage is much lower, despite running in a single thread.
I don't want to specify a number of threads, I wish the program pick the max number of threads according to the CPU capabilities, but I want that each for loop runs entirely in its own thread. How can I do that?
Also, my expectation is that the for loop for myBuffer_1 runs a thread, the other for loop runs another thread, and the rest runs in the 'master' thread. Is this correct?
#pragma omp single has an implicit barrier at the end, you need to use #pragma omp single nowait if you want the two single block run concurrently.
However, for your requirement, using section might be a better idea
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
}
I'm totally new to openmp and learning how to parallelized loops using task. I made the following loop:
#pragma omp parallel default(none) firstprivate(left) private(i) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
#pragma omp task
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp taskwait
}
I'm not sure if it's parallelized properly as it's taking more time than it's supposed to. How can I improve my code?
In this case, the task directive is totally irrelevant since (#pragma omp for) do the job.
Task is used for unbounded loops.
At the start of #pragma omp parallel a bunch of threads are created, then when we get to #pragma omp for the workload is distributed. What happens if this for loop has a for loop inside it, and I place a #pragma omp for before it as well? Does each thread create new threads? If not, which threads are assigned this task? What exactly happens in this situation?
By default, no threads are spawned for the inner loop. It is done sequentially using the thread that reaches it.
This is because nesting is disabled by default. However, if you enable nesting via omp_set_nested(), then a new set of threads will be spawned.
However, if you aren't careful, this will result in p^2 number of threads (since each of the original p threads will spawn another p threads.) Therefore nesting is disabled by default.
In a situation like the following:
#pragma omp parallel
{
#pragma omp for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
}
what happens is that you trigger an undefined behavior as you violate the OpenMP standard. More precisely you violate the restrictions appearing in section 2.5 (worksharing constructs):
The following restrictions apply to worksharing constructs:
Each worksharing region must be encountered by all threads in a team or by none at all.
The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.
This is clearly shown in the examples A.39.1c and A.40.1c:
Example A.39.1c: The following example of loop construct nesting is conforming because the inner and outer loop regions bind to different parallel
regions:
void work(int i, int j) {}
void good_nesting(int n)
{
int i, j;
#pragma omp parallel default(shared)
{
#pragma omp for
for (i=0; i<n; i++) {
#pragma omp parallel shared(i, n)
{
#pragma omp for
for (j=0; j < n; j++)
work(i, j);
}
}
}
}
Example A.40.1c: The following example is non-conforming because the inner and outer loop regions are closely nested
void work(int i, int j) {}
void wrong1(int n)
{
#pragma omp parallel default(shared)
{
int i, j;
#pragma omp for
for (i=0; i<n; i++) {
/* incorrect nesting of loop regions */
#pragma omp for
for (j=0; j<n; j++)
work(i, j);
}
}
}
Notice that this is different from:
#pragma omp parallel for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp parallel for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
in which you try to spawn a nested parallel region. Only in this case the discussion of Mysticial answer holds.
My algorithm (solving Poisson's equation) is completely parallelizable--provided that all the threads sync at the end of each iteration.
Function f, fNext;
init(f);
#pragma omp parallel
for(int step=0; step<maxITER; step++) {
#pragma omp for
for(int i=0; i<N; i++) {
for(int j=0; j<N; j++) {
fNext(i,j) = someOperator( f(i,j) );
}
}
f = fNext;
}//Threads must synchronize here
Does #pragma omp for ensure thread synchronization before continuing to the next iteration?
Yes. From the OpenMP Spec (eg, v 3.1, but this has been in since the beginning), under "worksharing constructs:"
There is an implicit barrier at the end of a loop construct unless
a nowait clause is specified.
That is, at the end of the for loop, unless you do something like #pragma omp for nowait, there is an implied barrier so that no thread will execute f=fNext until all threads are done the for loop.