OpenMP taskwait not working - c++

In the following code, I have created a parallel region using the #pragma omp parallel.
Within, the parallel region, there is a section of code that needs to be executed by only one thread which is achieved using #pragma omp single nowait.
Inside, the sequential region their is a FOR loop which can parallelized and I using #pragma omp taskloop to achieve it.
After the loop is done, I have used #pragma omp taskwait so as to make sure that the rest of the code is executed by only one thread. However, it seems the is not behaving as I am expecting. Multiple threads are accessing the section of the code after the #pragma omp taskwait which is declared under the region defined as #pragma omp single nowait.
std::vector<std::unordered_map<int, int>> veg_ht(n_comp + 1);
vec_ht[0].insert({root_comp_id, root_comp_node});
#pragma omp parallel
{
#pragma omp single
{
int nthreads = omp_get_num_threads();
for (int l = 0; l < n_comp; ++l) {
int bucket_count = vec_ht[l].bucket_count();
#pragma omp taskloop
for (int bucket_id = 0; bucket_id < bucket_count; ++bucket_id) {
if (vec_ht[l].bucket_size(bucket_id) == 0) { continue; }
int thread_id = omp_get_thread_num();
for (auto it_vec_ht = vec_ht[l].begin(bucket_id); it_vec_ht != vec_ht[l].end(bucket_id); ++it_vec_ht) {
// some operation --code removed for minimality
} // for it_vec_ht[l]
} // for bucket_id taskloop
#pragma omp taskwait
// Expected that henceforth all code will be accessed by one thread only
for (int tid = 0; tid < nthreads; ++tid) {
// some operation --code removed for minimality
} // for tid
} // for l
} // pragma omp single nowait
} // pragma parallel

It doesn't look like you necessarily need to use the enclosing parallel/single/taskloop layout. If you aren't going to specify the number of threads, then your system should default to using the maximum number of threads available. You can get this value outside of an OMP construct using omp_get_max_threads()'. Then you can use just the taskloop structure, or just replace it with a#pragma omp parallel for`.
I think the issue with your code is the #pragma omp taskwait line. The single thread should fork into many threads when it hits the taskloop construct, and then collapse back to a single thread afterwards. I think you might be triggering a brand new forking of your single thread with the #pragma omp taskwait command. An alternative to #pragma omp taskwait that definitely doesn't trigger thread forking is #pragma omp barrier. I think making this substitution will make your code work in its current form.

Related

OMP parallel for is not dividing iterations

I am trying to do distributed search using omp.h. I am creating 4 threads. Thread with id 0 does not perform the search instead it overseas which thread has found the number in array. Below is my code:
int arr[15]; //This array is randomly populated
int process=0,i=0,size=15; bool found=false;
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp cancellation point parallel
if(thread_id==0){
while(found==false){ continue; }
if(found==true){
cout<<"Number found by thread: "<<process<<endl;
#pragma omp cancel parallel
}
}
else{
#pragma omp parallel for schedule(static,5)
for(i=0;i<size;i++){
if(arr[i]==number){ //number is a int variable and its value is taken from user
found = true;
process = thread_id;
}
cout<<i<<endl;
}
}
}
The problem i am having is that each thread is executing for loop from i=0 till i=14. According to my understanding omp divides the iteration of the loops but this is not happening here. Can anyone tell me why and its possible solution?
Your problem is that you have a parallel inside a parallel. That means that each thread from the first parallel region makes a new team. That is called nested parallelism and it is allowed, but by default it's turned off. So each thread creates a team of 1 thread, which then executes its part of the for loop, which is the whole loop.
So your omp parallel for should be omp for.
But now there is another problem: your loop is going to be distributed over all threads, except that thread zero never gets to the loop. So you get deadlock.
.... and the actual solution to your problem is a lot more complicated. It involves creating two tasks, one that spins on the shared variable, and one that does the parallel search.
#pragma omp parallel
{
# pragma omp single
{
int p = omp_get_num_threads();
int found = 0;
# pragma omp taskgroup
{
/*
* Task 1 listens to the shared variable
*/
# pragma omp task shared(found)
{
while (!found) {
if (omp_get_thread_num()<0) printf("spin\n");
continue; }
printf("found!\n");
# pragma omp cancel taskgroup
} // end 1st task
/*
* Task 2 does something in parallel,
* sets `found' to true if found
*/
# pragma omp task shared(found)
{
# pragma omp parallel num_threads(p-1)
# pragma omp for
for (int i=0; i<p; i++)
// silly test
if (omp_get_thread_num()==2) {
printf("two!\n");
found = 1;
}
} // end 2nd task
} // end taskgroup
}
}
(Do you note the printf that is never executed? I needed that to prevent the compiler from "optimizing away" the empty while loop.)
Bonus solution:
#pragma omp parallel num_threads(4)
{
if(omp_get_thread_num()==0){ spin_on_found; }
if(omp_get_thread_num()!=0){
#pragma omp for nowait schedule(dynamic)
for ( loop ) stuff
The combination of dynamic and nowait can somehow deal with the missing thread.
#Victor Eijkhout already explained what happened here, I just want to show you a simpler (and data race free) solution.
Note that OpenMP has a significant overhead, in your case the overheads are bigger than the gain by parallelization. So, the best idea is not to use parallelization in this case.
If you do some expensive work inside the loop, the simplest solution is to skip this expensive work if it is not necessary. Note that I have used #pragma omp critical before found = true; to avoid data race.
#pragma omp parallel for
for(int i=0; i<size;i++){
if(found) continue;
// some expensive work here
if(CONDITION){
#pragma omp critical
found = true;
}
}
Another alternative is to use #pragma omp cancel for
#pragma omp parallel
#pragma omp for
for(int i=0; i<size;i++){
#pragma omp cancellation point for
// some expensive work here
if(CONDITION){
//cancelling the for loop
#pragma omp cancel for
}
}

Is it possible to make thread join to 'parallel for' region after its job?

I have two jobs that need to run simultaneously at first:
1) for loop that can be parallelized
2) function that can be done with one thread
Now, let me describe what I want to do.
If there exist 8 available threads,
job(1) and job(2) have to run simultaneously at first with 7 threads and 1 thread, respectively.
After job(2) finishes, the thread that job(2) was using should be allocated to job(1) which is the parallel for loop.
I'm using omp_get_thread_num to count how many threads are active in each region. I would expect the the number of threads in job(1) increases by 1 when job(2) finishes.
Below describes a solution that might be wrong or ok:
omp_set_nested(1);
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
}
#pragma omp section // job(1)
{
#pragma omp parallel for schedule(dynamic, 32)
for (int i = 0 ; i < 10000000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
}
}
}
}
How can make the work that I want to achieve be done?
What about something like this?
#pragma omp parallel
{
// note the nowait here so that other threads jump directly to the for loop
#pragma omp single nowait
{
job2();
}
#pragma omp for schedule(dynamic, 32)
for (int i = 0 ; i < 10000000; ++i) {
job1();
}
}
I did not test this but the single will be executed by only one threads while all others will jump directly to the for loop thanks to the nowait.
Also I think it is easier to read than with sections.
Another way (and potentially the better way) to express this would be to use OpenMP tasks:
#pragma omp parallel master
{
#pragma omp task // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
}
#pragma omp taskloop // job(1)
for (int i = 0 ; i < 10000000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
}
}
If you have a compiler that does not understand OpenMP version 5.0, then you have to split the parallel and master:
#pragma omp parallel
#pragma omp master
{
#pragma omp task // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
}
#pragma omp taskloop ]
for (int i = 0 ; i < 10000000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
}
}
The problem comes from synchronization. At the end of the section, omp waits for the termination of all threads and cannot release the thread on job 2 until its completion has been checked.
The solution requires to suppress the synchronization with a nowait.
I did not succeed to suppress synchronization with sections and nested parallelism. I rarely use nested parallel regions, but I think that, while sections can be nowaited, there is a problem when spawning the new nested parallel region inside a section. There is a mandatory synchronization at the end of a parallel section that cannot be suppressed and it probably prevents new threads to join the pool.
What I did is to use a single thread, without synchronization. This way, omp start the single thread and does not wait for its completion to start the parallel for. When the thread finishes its single work, it joins the thread pool to finish processing the for.
#include <omp.h>
#include <stdio.h>
int main() {
int singlethreadid=-1;
// omp_set_nested(1);
#pragma omp parallel
{
#pragma omp single nowait // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
singlethreadid=omp_get_thread_num();
}
#pragma omp for schedule(dynamic, 32)
for (int i = 0 ; i < 100000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
if (omp_get_thread_num() == singlethreadid)
printf("Hello, I\'m back\n");
}
}
}

All OpenMP Tasks running on the same thread

I have wrote a recursive parallel function using tasks in OpenMP. While it gives me the correct answer and runs fine I think there is an issue with the parallelism.The run-time in comparison with a serial solution does not scale in the same other parallel problem I have solved without tasks have. When printing each thread for the tasks they are all running on thread 0. I am compiling and running on Visual Studio Express 2013.
int parallelOMP(int n)
{
int a, b, sum = 0;
int alpha = 0, beta = 0;
for (int k = 1; k < n; k++)
{
a = n - (k*(3 * k - 1) / 2);
b = n - (k*(3 * k + 1) / 2);
if (a < 0 && b < 0)
break;
if (a < 0)
alpha = 0;
else if (p[a] != -1)
alpha = p[a];
if (b < 0)
beta = 0;
else if (p[b] != -1)
beta = p[b];
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
alpha = p[a];
beta = p[b];
}
else if (a > 0 && p[a] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp taskwait
}
}
alpha = p[a];
}
else if (b > 0 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
beta = p[b];
}
if (k % 2 == 0)
sum += -1 * (alpha + beta);
else
sum += alpha + beta;
}
if (sum > 0)
return sum%m;
else
return (m + (sum % m)) % m;
}
Sometimes I wish comments on SO could be as richly formatted as the answers, but alas that's not the case. Therefore, here comes a long comment disguised as an answer.
It appears that a very common mistake in writing recursive OpenMP code is not understanding how exactly parallel regions work. Consider the following code (uses explicit tasks, therefore support for OpenMP 3.0 or newer required):
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp parallel num_threads(2)
{
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
}
}
// somewhere in the main function
par_rec_func(10);
There is a problem with this code. The problem is that, except for the top-level invocation of par_rec_func(), in all other invocations the parallel region will be created in the context of an enclosing outer parallel region. This is called nested parallelism and by default is disabled, which means that all parallel regions beneath the top-level one are going to be inactive, i.e. they will execute serially. Since tasks bind to the innermost parallel region, they will also get executed in serial. What will happen with this code is that it will spawn one additional thread (for a total of two) at the top-level invocation of par_rec_func() and each thread will then execute a whole branch of the recursion tree (i.e. one half of the whole tree). If one runs that code on a machine with 64 cores, 62 of them will idle. In order for the nested parallelism to be enabled, one has to either set the environment variable OMP_NESTED to true or call omp_set_nested() and pass it a true flag:
omp_set_nested(1);
Once nested parallelism has been enabled, one faces a new problem. Every time a nested parallel region is encountered, the encountering thread will either spawn an additional one (because of num_threads(2)) or acquire an idle thread from the runtime's thread pool. At every deeper level of recursion, this program will require twice as many threads as at the previous level. Though an upper limit of the total number of threads could be set via OMP_THREAD_LIMIT (another OpenMP 3.0 feature) and with the overhead aside, this is not what one really wants in such cases.
The correct solution in that case is to use orphaned tasks in the dynamic scope of a single parallel region:
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
// Wait for the child tasks to complete if necessary
#pragma omp taskwait
}
// somewhere in the main function
#pragma omp parallel
{
#pragma omp single
par_rec_func(10);
}
The advantages of this method are many. First of all, only a single parallel region is created with as many threads as specified (e.g. by setting OMP_NUM_THREADS or by any other means). When the child tasks call recursively into par_rec_func(), that simply adds new tasks to the parallel region without spawning new threads. This greatly helps in the case where the recursion tree is not balanced, since many quality OpenMP runtimes implement task stealing, e.g. thread i could execute child tasks of a task that executes in thread j, where i != j.
Given an OpenMP 2.0 compiler like VC++, one cannot do much except to approximate the above idea by using nested parallelism and explicitly disabling it at a certain level:
void par_rec_func (int arg)
{
if (arg <= 0) return;
int level = omp_get_level();
#pragma omp parallel sections num_threads(2) if(level < 4)
{
#pragma omp section
par_rec_func(arg-1);
#pragma omp section
par_rec_func(arg-1);
}
}
// somewhere in the main function
int saved_nested = omp_get_nested();
omp_set_nested(1);
par_rec_func(10);
omp_set_nested(saved_nested);
omp_get_level() is used to determine the level of nesting and the if clause is used to selectively deactivate parallel regions at fourth or deeper level of nesting. This solution is dumb and won't work well when the recursion tree is unbalanced.
Actual Problem:
You are using Visual Studio 2013.
Visual Studio has never supported OMP versions beyond 2.0 (see here).
OMP Tasks are a feature of OMP 3.0 (see spec).
Ergo, using VS at all means no OMP tasks for you.
If OMP Tasks are an essential requirement, use a different compiler. If OMP is not an essential requirement, you should consider an alternative parallel task handling library. Visual Studio includes the MS Concurrency Runtime, and the Parallel Patterns Library built on top of it. I have recently moved from OMP to PPL due to the fact I'm using VS for work; it isn't quite a drop-in replacement but it is quite capable.
My second attempt at solving this, again preserved for historical reasons:
So, the problem is almost certainly that you're defining your omp tasks outside of a omp parallel region.
Here's a contrived example:
void work()
{
#pragma omp parallel
{
#pragma omp single nowait
for (int i = 0; i < 5; i++)
{
#pragma omp task untied
{
std::cout <<
"starting task " << i <<
" on thread " << omp_get_thread_num() << "\n";
sleep(1);
}
}
}
}
If you omit the parallel declaration, the job runs serially:
starting task 0 on thread 0
starting task 1 on thread 0
starting task 2 on thread 0
starting task 3 on thread 0
starting task 4 on thread 0
But if you leave it in:
starting task starting task 3 on thread 1
starting task 0 on thread 3
2 on thread 0
starting task 1 on thread 2
starting task 4 on thread 2
Success, complete with authentic misuse of shared output resources.
(for reference, if you omit the single declaration, each thread will run the loop, resulting in 20 tasks being run on my 4 cpu VM).
Original answer included below for completeness, but no longer relevant!
In every case, your omp task is a single, simple thing. It probably runs and completes immediately:
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
Because you never start one long-running task before firing off the next task, everything will probably run on the first allocated thread.
Perhaps you meant to do something like this?
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
alpha = p[a];
beta = p[b];
}

pragma omp for inside pragma omp master or single

I'm sitting with some stuff here trying to make orphaning work, and reduce the overhead by reducing the calls of #pragma omp parallel.
What I'm trying is something like:
#pragma omp parallel default(none) shared(mat,mat2,f,max_iter,tol,N,conv) private(diff,k)
{
#pragma omp master // I'm not against using #pragma omp single or whatever will work
{
while(diff>tol) {
do_work(mat,mat2,f,N);
swap(mat,mat2);
if( !(k%100) ) // Only test stop criteria every 100 iteration
diff = conv[k] = do_more_work(mat,mat2);
k++;
} // end while
} // end master
} // end parallel
The do_work depends on the previous iteration so the while-loop is has to be run sequential.
But I would like to be able to run the ´do_work´ parallel, so it would look something like:
void do_work(double *mat, double *mat2, double *f, int N)
{
int i,j;
double scale = 1/4.0;
#pragma omp for schedule(runtime) // Just so I can test different settings without having to recompile
for(i=0;i<N;i++)
for(j=0;j<N;j++)
mat[i*N+j] = scale*(mat2[(i+1)*N+j]+mat2[(i-1)*N+j] + ... + f[i*N+j]);
}
I hope this can be accomplished some way, I'm just not sure how. So any help I can get is greatly appreciated (also if you're telling me this isn't possible). Btw I'm working with open mp 3.0, the gcc compiler and the sun studio compiler.
The outer parallel region in your original code contains only a serial piece (#pragma omp master), which makes no sense and effectively results in purely serial execution (no parallelism). As do_work() depends on the previous iteration, but you want to run it in parallel, you must use synchronisation. The openmp tool for that is an (explicit or implicit) synchronisation barrier.
For example (code similar to yours):
#pragma omp parallel
for(int j=0; diff>tol; ++j) // must be the same condition for each thread!
#pragma omp for // note: implicit synchronisation after for loop
for(int i=0; i<N; ++i)
work(j,i);
Note that the implicit synchronisation ensures that no thread enters the next j if any thread is still working on the current j.
The alternative
for(int j=0; diff>tol; ++j)
#pragma omp parallel for
for(int i=0; i<N; ++i)
work(j,i);
should be less efficient, as it creates a new team of threads at each iteration, instead of merely synchronising.

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".