openmp tasks in nested loops - c++

I am trying to write the following piece of code.
#pragma omp parallel
{
int .... some variables
for (int x:map){
int ...
#pragma omp single
{
#pragma omp task firstprivate(x,..) depend(out:a)
{
assigning the variables some values
}
for (int loop over j)
{
#pragma omp task firstprivate(j) depend (in:a) depend (out:b)
{
}
third loop over k
#pragma omp task depend(in:a,b)
{
}
}
}
}
Is this valid? The threads are getting formed but they are not getting into either (the loop over j i.e. the second loop) or the 3rd loop (I checked by print statements).
Please suggest how to correct this.
I tried printing inside the 2 loops and saw that nothing is getting printed which implies the threads don't enter into the loops at all. I was expecting that the work would be distributed among the threads but unfortunately I am unable to achieve this.
Minimum reproducible example:
(as asked in the comments I am making an example)
int a,b;
#pragma omp parallel
{
#pragma omp single
{for (int &x:map)
{
#pragma omp task
for(int i=0;i<x.second;++i)
{
vector<int> val = m2[i];
for (int j=0;j<val.size();++j)
{
#pragma omp critical
update a global map m3.
}
}
}
}

Related

OpenMP nested loop task parallelism, counter not giving correct result

I am pretty new in openMP. I am trying to parallelize the nested loop using tasking but it didn't give me the correct counter output. Sequential output is "Total pixel = 100000000". Can anyone help me with that?
Note: I have done this using #pragma omp parallel for reduction (+:pixels_inside) private(i,j). This works fine now I want to use tasking.
what I have try so far:
#include<iostream>
#include<omp.h>
using namespace std;
int main(){
int total_steps = 10000;
int i,j;
int pixels_inside=0;
omp_set_num_threads(4);
//#pragma omp parallel for reduction (+:pixels_inside) private(i,j)
#pragma omp parallel
#pragma omp single private(i)
for(i = 0; i < total_steps; i++){
#pragma omp task private(j)
for(j = 0; j < total_steps; j++){
pixels_inside++;
}
}
cout<<"Total pixel = "<<pixels_inside<<endl;
return 0;
}
First of all you need to declare for OpenMP what variables you are using and what protection do they have. Generally speaking your code has default(shared) as you didn't specified otherwise. This makes all variables accessible with same memory location for all threads.
You should use something like this:
#pragma omp parallel default(none) shared(total_steps, pixels_inside)
[...]
#pragma omp task private(j) default(none) shared(total_steps, pixels_inside)
Now, only what is necessary will be used by threads.
Secondly the main problem is that you don't have critical section protection. What this means, that when threads are running they may wish to use shared variable and race condition happens. For example, you have thread A and B with variable x accessible to both (a.k.a. shared memory variable). Now lets say A adds 2 and B adds 3 to the variable. Threads aren't same speed so this may happen, A takes x=0, B takes x=0, A adds 0+2, B adds 0+3, B returns data to memory location x=3, A returns data to memory location x=2. In end x = 2. The same happens with pixels_inside, as thread takes variable, adds 1 and returns it back from where it got it. To overcome this you use measurements to insure critical section protection:
#pragma omp critical
{
//Code with shared memory
pixels_inside++;
}
You didn't needed critical section protection in reduction as variables in recution parameters have this protection.
Now your code should look like this:
#include <iostream>
#include <omp.h>
using namespace std;
int main() {
int total_steps = 10000;
int i,j;
int pixels_inside=0;
omp_set_num_threads(4);
//#pragma omp parallel for reduction (+:pixels_inside) private(i,j)
#pragma omp parallel default(none) shared(total_steps, pixels_inside)
#pragma omp single private(i)
for(i = 0; i < total_steps; i++){
#pragma omp task private(j) default(none) shared(total_steps, pixels_inside)
for(j = 0; j < total_steps; j++){
#pragma omp critical
{
pixels_inside++;
}
}
}
cout<<"Total pixel = "<<pixels_inside<<endl;
return 0;
}
Although I would suggest using reduction as it has better performance and methods to optimize that kind of calculations.
As #tartarus already explained you have a race condition in your code and it is much better to avoid it by using reduction. If you what to do the same as #pragma omp parallel for reduction (+:pixels_inside) private(i,j) do but using tasks, you have to use the following:
#pragma omp parallel
#pragma omp single
#pragma omp taskloop reduction (+:pixels_inside) private(i,j)
for(i = 0; i < total_steps; i++){
for(j = 0; j < total_steps; j++){
pixels_inside++;
}
}
In this version fewer tasks are created and reduction is used instead of critical section, therefore the performance will be much better (similar to what you can obtain by using #pragma omp parallel for)
UPDATE(comment on performance): I guess it is just a simplified example not your real code to parallelize. If the performance gain is not good enough, most probably it means that the parallel overhead is bigger than the work to do. In this case try to parallelize bigger part of your code. Note that parallel overheads are typically bigger in case of tasks (compared to #pragma omp parallel for).

Spanning OpenMP parallel regions across multiple functions/objects

Is there a way to span an OpenMP parallel region across multiple functions?
void run()
{
omp_set_num_threads(2);
#pragma omp parallel
{
foo();
#pragma omp for
for(int i = 0; i < 10; ++i)
{
//Do stuff here
}
}
}
void foo()
{
#pragma omp for
for(int j = 0; j < 10; ++j)
{
// Have this code be run as a worksharing loop by the OMP threads
// spawned in run
}
}
In this example, I want the threads started in the omp parallel region in the run function to enter foo, and run it as a working sharing loop, the same way they would run the for loop in run. Is this what happens by default or does each thread run the loop independently? How do you test for each?
In my example, function foo and run are member functions is separate classes.
Thanks!
What you describe as your desire is how OpenMP works.

openmp critical section inside for loop

I have the following code that updates something inside a for loop, with another for loop coming after it. However, I got the error: "expected a declaration" at the beginning of the second loop. The problem seems to be at the "critical" part, because if I delete it, the error will be gone. I'm fresh new to openMP and I was following an example here: http://www.viva64.com/en/a/0054/#ID0EBUEM (refer to "5. Too many entries to critical sections"). Anybody has any idea what I'm doing wrong here?
Besides, is it true that "If the comparison is performed before the critical section, the critical section will not be entered during all iterations of the loop"?
Another thing is that I actually want to parallelize the two loops at the same time, but since the operations inside the loops are different, I use two thread teams here, hoping that if there are threads that are not needed in the first loop, they can start executing the second loop immediately. Will this work?
double maxValue = 0.0;
#pragma omp parallel for schedule (dynamic) //first loop
for (int i = 0; i < n; i++){
if (some condition satisfied)
{
#pragma omp atomic
count++;
continue;
}
double tmp = getValue(i);
#pragma omp flush(maxValue)
if (tmp > maxValue){
#pragma omp critical(updateMaxValue){
if (tmp > maxValue){
maxValue = tmp;
//update some other variables
...
}
}
}
}
#pragma omp parallel for schedule (dynamic) //second loop
for (int i = 0; i < m; i++){
//some operations...
}
#pragma omp barrier
Sorry that I have so many questions and thanks in advance!
However, I got the error: "expected a declaration" at the beginning of the second loop.
You have a syntax error - an opening brace, if present, must be moved to a new line:
#pragma omp critical(updateMaxValue){
// ~^~
should be changed to:
#pragma omp critical(updateMaxValue)
{
(You don't need it actually, since the if-statement that follows is a structured block).
Another thing is that I actually want to parallelize the two loops at the same time, but since the operations inside the loops are different, I use two thread teams here, hoping that if there are threads that are not needed in the first loop, they can start executing the second loop immediately.
Use a single parallel region, and then a nowait clause on the first for-loop:
#pragma omp parallel
{
#pragma omp for schedule(dynamic) nowait
// ~~~~~^
for (int i = 0; i < n; i++)
{
// ...
}
#pragma omp for schedule(dynamic)
for (int i = 0; i < m; i++)
{
// ...
}
}

OpenMP: having a complete 'for' loop into each thread

I have this code:
#pragma omp parallel
{
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp single
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
// and so on... up to 5 or 6 of myObject_x
// Then I sum up the buffers and do something with them
float result;
for (int i=0; i<given_number; ++i)
result = myBuffer_1[i] + myBuffer_2[i];
// do something with result
If I run this code, I get what I expect but the CPU usage looks quite high. Instead, if I run it normally without OpenMP I get the same results but the CPU usage is much lower, despite running in a single thread.
I don't want to specify a number of threads, I wish the program pick the max number of threads according to the CPU capabilities, but I want that each for loop runs entirely in its own thread. How can I do that?
Also, my expectation is that the for loop for myBuffer_1 runs a thread, the other for loop runs another thread, and the rest runs in the 'master' thread. Is this correct?
#pragma omp single has an implicit barrier at the end, you need to use #pragma omp single nowait if you want the two single block run concurrently.
However, for your requirement, using section might be a better idea
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_1[i] = myObject_1->myFunction();
}
#pragma omp section
{
for (int i=0; i<given_number; ++i) myBuffer_2[i] = myObject_2->myFunction();
}
}
}

How to determine if a loop using "task" is parallelized?

I'm totally new to openmp and learning how to parallelized loops using task. I made the following loop:
#pragma omp parallel default(none) firstprivate(left) private(i) shared(length, pivot, data)
{
#pragma omp for
for(i = 1; i<length-1; i++)
{
#pragma omp task
{
if(data[left] > pivot)
{
i = length;
}
else
{
left = i;
}
}
}
#pragma omp taskwait
}
I'm not sure if it's parallelized properly as it's taking more time than it's supposed to. How can I improve my code?
In this case, the task directive is totally irrelevant since (#pragma omp for) do the job.
Task is used for unbounded loops.