OpenMP code waits on Join Barrier most of the time - c++

I have a piece of code
void parallel_func()
{
#pragma omp parallel
{
#pragma omp for collapse(2) schedule(dynamic) nowait
for(i=0; i<N; i++) {
for(j=0;j<N;j++) {
if (i>j) continue; // hack to allow collapse here
//...
}
}
#pragma omp critical
{
//...
}
}
}
Using a profiler, I noticed that most of the time my code spends... on waiting on OpenMP Join Barrier ... Any idea why? Or how to identify the cause?
~

Where is omp parallel? I assume that parallel_func is inside a omp parallel section.
It's unclear since you didn't say which join barrier caused huge overhead. In your code, omp for does have nowait, which means no implicit barrier. You have omp critical. This is literally a critical section, so this will not make barrier operations. (If omp single, then a join barrier is needed unless omp single nowait.
So, the only suspected place of join barrier is from the omp parallel section, which isn't shown in your code. If the end of parallel_func is the end of the omp parallel section, and your omp parallel doesn't have nowait, then the join barrier is from the end of the parallel_func.
Finally, how to identify the cause? It's mostly because of workload imbalance. The amount of work per each thread might be too highly deviated. This will make some threads are wasting their time on the implicit join barrier. Please identify the workload distribution.

Related

In OpenMP how can we run in parallel multiple code blocks where each block contains omp single and omp for loops?

In C++ Openmp how could someone run in parallel multiple code blocks where each block contains omp single and omp for loops?
More precisely, I have 3 functions:
block1();
block2();
block3();
I want each of these 3 functions to run in parallel. However I do not want each one of these functions to be assigned a single thread. If I wanted each one of them to use a single thread I could enclose them in three "#pragma omp single nowait" followed by a "#pragma barrier" at the end. Instead each one of these three functions may look something like this:
#pragma omp single
{
//some code here
}
#pragma omp for nowait
for(std::size_t i=0;i<numloops;i++)
{
//some code here
}
Notice in the above code that I need an omp single region to be executed before each parallel for loop. If I did not have this constraint I could have simply added a "nowait" to the "omp single". Instead because I have the "omp single" without a "nowait" I do not want block2() to have to wait for the "omp single" region in block1() to complete. Nor do I want block3() to have to wait for the "omp single" region in block2() to complete. Any ideas? Thanks
The best solution is using tasks. Run each block() in different tasks, so they run parallel:
#pragma omp parallel
#pragma omp single nowait
{
#pragma omp task
block1();
#pragma omp task
block2();
#pragma omp task
block3();
}
In block() you can set some code, which is executed before the for loop and you can use taskloop to distribute work among the available threads.
void block1()
{
//single thread code here
{
//.... this code runs before the loop and independent of block2 and block3
}
#pragma omp taskloop
for(std::size_t i=0;i<numloops;i++)
{
//some code here - this is distributed among the remaining threads
}
}

Is there a way to break out of #omp parallel

I've got a situation where I have two #pragma omp tasks inside a #pragma omp parallel block
The first task is a simple job of just waiting 5 seconds. The second task has the more difficult job of waiting for a complex user input action.
bool timed_out=false;
#pragma omp parallel num_threads(2), shared(timed_out)
{
#pragma omp task
{
sleep(5);
#pragma omp atomic write
time_out=true;
}
#pragma omp task
{
// wait for user input
}
#pragma omp taskwait
}
Basically, what I'd like to happen as either after the user input is received successfully or the 5 second time out is hit then I'd like to break out of the #pragma omp parallel section and continue with main.
I don't think I can use #pragma omp single after my taskwait because if the user input is received the next thing that would occur is the spawning of two worker threads.
Please note that your initial example does not generate two tasks, but four, as each of the two OpenMP threads in the parallel region will encounter the task construct and thus create a task. You would have to wrap the two task constructs with a master or single construct to avoid this and ensure that only one task creates tasks:
bool timed_out=false;
#pragma omp parallel num_threads(2), shared(timed_out)
{
#pragma omp master
{
#pragma omp task
{
sleep(5);
#pragma omp atomic write
time_out=true;
}
#pragma omp task
{
// wait for user input
}
#pragma omp taskwait
}
}
For the termination of the waiting, second task, you can use OpenMP cancellation:
bool timed_out=false;
#pragma omp parallel master num_threads(2), shared(timed_out)
{
#pragma omp taskgroup
{
#pragma omp task
{
sleep(5);
#pragma omp atomic write
time_out=true;
#pragma omp cancel taskgroup
}
#pragma omp task
{
while(true) {
#pragma omp taskyield
#pragma omp cancellation point taskgroup
}
}
#pragma omp taskwait
}
The taskgroup is needed to define the tasks that should be affected by cancel construct. The cancellation point construct in the waiting task will terminate the while loop once the cancel construct was encountered. As the second task is spin-waiting it contains a taskyield to introduce a task scheduling point and permit the OpenMP implementation to schedule another task (this is not needed for your minimal example tough, but might be useful for a code with more OpenMP tasks).

Optimize loop with openmp

I've got the following loop:
while (a != b) {
#pragma omp parallel
{
#pragma omp for
// first for
#pragma omp for
// second for
}
}
In this way the team is created at each loop. Is it possible to rearrange the code in order to have a single team? "a" variable is accessed with omp atomic inside the loop and "b" is a constant.
The only thing that comes to my mind is something like this:
#pragma omp parallel
{
while (a != b) {
#pragma omp barrier
// This barrier ensures that threads
// wait each other after evaluating the condition
// in the while loop
#pragma omp for
// first for (implicit barrier)
#pragma omp for
// second for (implicit barrier)
// The second implicit barrier ensures that every
// thread will have the same view of a
} // while
} // omp parallel
In this way each thread will evaluate the condition, but every evaluation will be consistent with the others. If you really want a single thread to evaluate the condition, then you should think of transforming your worksharing constructs into task constructs.

OpenMP tasks passing "shared" pointers

I would like to use the task pragmas of openMP for the next code:
std::vector<Class*> myVectorClass;
#pragma omp parallel
{
#pragma omp single nowait
{
for (std::list<Class*>::iterator it = myClass.begin(); it != myClass.end();) {
#pragma omp task firstprivate(it)
(*it)->function(t, myVectorClass))
++it;
}
}
#pragma omp taskwait
}
The problem, or one of them, is that the myVectorClass is a pointer to an object. So it is not possible to set this vector as shared. myVectorClass is modified by the function. The previous code crash. So, could you tell me how to modify the previous code (without using the for-loop pragmas)?
Thanks
myVectorClass is a vector of pointers. In your current code, you set it as shared. Since your code crashes, I suppose you changes the length of myVectorClass in function(). However std::vector is not thread-safe, so modifying the length in multiple threads will crash its data structure.
Depending on what exactly function() does, you could have simple solutions. The basic idea is to use one thread-local vector per thread to collect the result of function() first, then concatenate/merge these vectors into a single one.
The code shown here gives a good example.
C++ OpenMP Parallel For Loop - Alternatives to std::vector
std::vector<int> vec;
#pragma omp parallel
{
std::vector<int> vec_private;
#pragma omp for nowait //fill vec_private in parallel
for(int i=0; i<100; i++) {
vec_private.push_back(i);
}
#pragma omp critical
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}

Creating threads within a multithreaded for loop using openMP

I am new to OpenMP and I am not able to create threads within each threaded loop iteration. My question may sound naive, please bear with me.
#pragma omp parallel private(a,b) shared(f)
{
#pragma omp for
for(...)
{
//some operations
// I want to parallelize the code in italics along within in the multi threaded for loop
*int x=func1(a,b);*
*int val1=validate(x);*
*int y=func2(a,b);*
*int val2=validate(y);*
}
}
Within the for loop all threads are busy with loop iterations, so there are no resources left to execute stuff in side an iteration in parallel. And in case the work is well balanced you won't gain any better performance.
If it is hard/impossible to well-balance the work with a parallel for. You can try generating tasks within the loop, and do the work afterwords. But be aware of the overhead of task generation.
#pragma omp parallel private(a,b) shared(f)
{
#pragma omp for nowait
for(...)
{
//some operations
#pragma omp task
{
int x=func1(a,b);
int val1=validate(x);
}
#pragma omp task
{
int y=func2(a,b);
int val2=validate(y);
}
}
// wait for all tasks to be finished (implicit at the end of the parallel region (here))
#pragma omp taskwait
}