Never ending loop in openmp section - c++

I've got the following code in c++ using openmp 2.5:
total = 500;
counter = 0;
#pragma omp parallel
{
while (counter != total) {
#pragma omp barrier
#pragma omp for
for (it = vec.begin(); it < vec.end(); ++it) {
work(*it);
}
}
}
In work() I increase/decrease counter in the following way:
void increase() {
#pragma omp atomic
counter++;
}
void decrease() {
#pragma omp atomic
counter--;
}
Sometimes the loop never ends however if I remove the parallel pragma code, it works. I suspect I implemented in a wrong way. I need someone expert of openmp code.

Related

openmp tasks in nested loops

I am trying to write the following piece of code.
#pragma omp parallel
{
int .... some variables
for (int x:map){
int ...
#pragma omp single
{
#pragma omp task firstprivate(x,..) depend(out:a)
{
assigning the variables some values
}
for (int loop over j)
{
#pragma omp task firstprivate(j) depend (in:a) depend (out:b)
{
}
third loop over k
#pragma omp task depend(in:a,b)
{
}
}
}
}
Is this valid? The threads are getting formed but they are not getting into either (the loop over j i.e. the second loop) or the 3rd loop (I checked by print statements).
Please suggest how to correct this.
I tried printing inside the 2 loops and saw that nothing is getting printed which implies the threads don't enter into the loops at all. I was expecting that the work would be distributed among the threads but unfortunately I am unable to achieve this.
Minimum reproducible example:
(as asked in the comments I am making an example)
int a,b;
#pragma omp parallel
{
#pragma omp single
{for (int &x:map)
{
#pragma omp task
for(int i=0;i<x.second;++i)
{
vector<int> val = m2[i];
for (int j=0;j<val.size();++j)
{
#pragma omp critical
update a global map m3.
}
}
}
}

OpenMP: having a complete 'for' loop into each thread

I have this code:
#pragma omp parallel
{
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
// and so on... up to 5 or 6 of myObject_x
// Then I sum up the buffers and do something with them
float result;
for (int i=0; i<given_number; ++i)
result = myBuffer_1[i] + myBuffer_2[i];
// do something with result
If I run this code, I get what I expect but the CPU usage looks quite high. Instead, if I run it normally without OpenMP I get the same results but the CPU usage is much lower, despite running in a single thread.
I don't want to specify a number of threads, I wish the program pick the max number of threads according to the CPU capabilities, but I want that each for loop runs entirely in its own thread. How can I do that?
Also, my expectation is that the for loop for myBuffer_1 runs a thread, the other for loop runs another thread, and the rest runs in the 'master' thread. Is this correct?
#pragma omp single has an implicit barrier at the end, you need to use #pragma omp single nowait if you want the two single block run concurrently.
However, for your requirement, using section might be a better idea
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
}

How to determine if a loop using "task" is parallelized?

I'm totally new to openmp and learning how to parallelized loops using task. I made the following loop:
#pragma omp parallel default(none) firstprivate(left) private(i) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
#pragma omp task
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp taskwait
}
I'm not sure if it's parallelized properly as it's taking more time than it's supposed to. How can I improve my code?
In this case, the task directive is totally irrelevant since (#pragma omp for) do the job.
Task is used for unbounded loops.

Decreasing number of iterations in OpenMP parallel for

I have a parallel for in a C++ program that has to loop up to some number of iterations. Each iteration computes a possible solution for an algorithm, and I want to exit the loop once I find a valid one (it is ok if a few extra iterations are done). I know the number of iterations should be fixed from the beginning in the parallel for, but since I'm not increasing the number of iterations in the following code, is there any guarantee of that threads check the condition before proceeding with their current iteration?
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition)
max_its = t; // valid to make threads exit the for?
}
}
Modifying the loop counter works for most implementations of OpenMP worksharing constructs, but the program will no longer be conforming to OpenMP and there is no guarantee that the program works with other compilers.
Since the OP is OK with some extra iterations, OpenMP cancellation will be the way to go. OpenMP 4.0 introduced the "cancel" construct exactly for this purpose. It will request termination of the worksharing construct and teleport the threads to the end of it.
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
...
if(some condition) {
#pragma omp cancel for
}
#pragma omp cancellation point for
}
}
Be aware that might there might be a price to pay in terms of performance, but you might want to accept this if the overall performance is better when aborting the loop.
In pre-4.0 implementations of OpenMP, the only OpenMP-compliant solution would be to have an if statement to approach the regular end of the loop as quickly as possible without execution the actual loop body:
void fun()
{
int max_its = 100;
#pragma omp parallel for schedule(dynamic, 1)
for(int t = 0; t < max_its; ++t)
{
if(!some condition) {
... loop body ...
}
}
}
Hope that helps!
Cheers,
-michael
You can't modify max_its as the standard says it must be a loop invariant expression.
What you can do, though, is using a boolean shared variable as a flag:
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel for schedule(dynamic, 1) shared(found)
for(int t = 0; t < max_its; ++t)
{
if( ! found ) {
...
}
if(some condition) {
#pragma omp atomic
found = true; // valid to make threads exit the for?
}
}
}
A logic of this kind may be also implemented with tasks instead of a work-sharing construct. A sketch of the code would be something like the following:
void algorithm(int t, bool& found) {
#pragma omp task shared(found)
{
if( !found ) {
// Do work
if ( /* conditionc*/ ) {
#pragma omp atomic
found = true
}
}
} // task
} // function
void fun()
{
int max_its = 100;
bool found = false;
#pragma omp parallel
{
#pragma omp single
{
for(int t = 0; t < max_its; ++t)
{
algorithm(t,found);
}
} // single
} // parallel
}
The idea is that a single thread creates max_its tasks. Each task will be assigned to a waiting thread. If some of the tasks find a valid solution, then all the others will be informed by the shared variable found.
If some_condition is a logical expression that is "always valid", then you could do:
for(int t = 0; t < max_its && !some_condition; ++t)
That way, it's very clear that !some_condition is required to continue the loop, and there is no need to read the rest of the code to find out that "if some_condition, loop ends"
Otherwise (for example if some_condition is the result of some calculation inside the loop and it's complicated to "move" the some_condition to the for-loop condition, then using break is clearly the right thing to do.

What happens in OpenMP when there's a pragma for inside a pragma for?

At the start of #pragma omp parallel a bunch of threads are created, then when we get to #pragma omp for the workload is distributed. What happens if this for loop has a for loop inside it, and I place a #pragma omp for before it as well? Does each thread create new threads? If not, which threads are assigned this task? What exactly happens in this situation?
By default, no threads are spawned for the inner loop. It is done sequentially using the thread that reaches it.
This is because nesting is disabled by default. However, if you enable nesting via omp_set_nested(), then a new set of threads will be spawned.
However, if you aren't careful, this will result in p^2 number of threads (since each of the original p threads will spawn another p threads.) Therefore nesting is disabled by default.
In a situation like the following:
#pragma omp parallel
{
#pragma omp for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
}
what happens is that you trigger an undefined behavior as you violate the OpenMP standard. More precisely you violate the restrictions appearing in section 2.5 (worksharing constructs):
The following restrictions apply to worksharing constructs:
Each worksharing region must be encountered by all threads in a team or by none at all.
The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.
This is clearly shown in the examples A.39.1c and A.40.1c:
Example A.39.1c: The following example of loop construct nesting is conforming because the inner and outer loop regions bind to different parallel
regions:
void work(int i, int j) {}
void good_nesting(int n)
{
int i, j;
#pragma omp parallel default(shared)
{
#pragma omp for
for (i=0; i<n; i++) {
#pragma omp parallel shared(i, n)
{
#pragma omp for
for (j=0; j < n; j++)
work(i, j);
}
}
}
}
Example A.40.1c: The following example is non-conforming because the inner and outer loop regions are closely nested
void work(int i, int j) {}
void wrong1(int n)
{
#pragma omp parallel default(shared)
{
int i, j;
#pragma omp for
for (i=0; i<n; i++) {
/* incorrect nesting of loop regions */
#pragma omp for
for (j=0; j<n; j++)
work(i, j);
}
}
}
Notice that this is different from:
#pragma omp parallel for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp parallel for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
in which you try to spawn a nested parallel region. Only in this case the discussion of Mysticial answer holds.