openmp distribute threads to certain code blocks - c++

In my program I need to divide n threads in following way:
1) Thread1 is doing his specific work
2) n-1 other threads do their own work
If number_of _threads == 1 then only action 1) is done.
Actions 1) and 2) are computed in parallel
void main(){
int number_of_threads;
std::cin>>number_of_threads;
omp_set_num_threads(number_of_threads);
#pragma omp parallel if (number_of_threads>1)
{
#pragma omp master // no barrier at the end of master block
single_calc();
#pragma omp ??(number_of_threads-1) //second block
// this section of code is computed by n-1 processes
}
}
I came up to following solutions
1)hardcode so that thread with id == 1 doesn't compute the second block
2)as far as master thread has id = 0 I can use #pragma omp for starting with i=1 in the second block
3) call single_calc() outside #pragma omp parallel (but I want to control amount of threads calculating this block)
Is there any elegant solution to this?

Related

OMP parallel for is not dividing iterations

I am trying to do distributed search using omp.h. I am creating 4 threads. Thread with id 0 does not perform the search instead it overseas which thread has found the number in array. Below is my code:
int arr[15]; //This array is randomly populated
int process=0,i=0,size=15; bool found=false;
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
#pragma omp cancellation point parallel
if(thread_id==0){
while(found==false){ continue; }
if(found==true){
cout<<"Number found by thread: "<<process<<endl;
#pragma omp cancel parallel
}
}
else{
#pragma omp parallel for schedule(static,5)
for(i=0;i<size;i++){
if(arr[i]==number){ //number is a int variable and its value is taken from user
found = true;
process = thread_id;
}
cout<<i<<endl;
}
}
}
The problem i am having is that each thread is executing for loop from i=0 till i=14. According to my understanding omp divides the iteration of the loops but this is not happening here. Can anyone tell me why and its possible solution?
Your problem is that you have a parallel inside a parallel. That means that each thread from the first parallel region makes a new team. That is called nested parallelism and it is allowed, but by default it's turned off. So each thread creates a team of 1 thread, which then executes its part of the for loop, which is the whole loop.
So your omp parallel for should be omp for.
But now there is another problem: your loop is going to be distributed over all threads, except that thread zero never gets to the loop. So you get deadlock.
.... and the actual solution to your problem is a lot more complicated. It involves creating two tasks, one that spins on the shared variable, and one that does the parallel search.
#pragma omp parallel
{
# pragma omp single
{
int p = omp_get_num_threads();
int found = 0;
# pragma omp taskgroup
{
/*
* Task 1 listens to the shared variable
*/
# pragma omp task shared(found)
{
while (!found) {
if (omp_get_thread_num()<0) printf("spin\n");
continue; }
printf("found!\n");
# pragma omp cancel taskgroup
} // end 1st task
/*
* Task 2 does something in parallel,
* sets `found' to true if found
*/
# pragma omp task shared(found)
{
# pragma omp parallel num_threads(p-1)
# pragma omp for
for (int i=0; i<p; i++)
// silly test
if (omp_get_thread_num()==2) {
printf("two!\n");
found = 1;
}
} // end 2nd task
} // end taskgroup
}
}
(Do you note the printf that is never executed? I needed that to prevent the compiler from "optimizing away" the empty while loop.)
Bonus solution:
#pragma omp parallel num_threads(4)
{
if(omp_get_thread_num()==0){ spin_on_found; }
if(omp_get_thread_num()!=0){
#pragma omp for nowait schedule(dynamic)
for ( loop ) stuff
The combination of dynamic and nowait can somehow deal with the missing thread.
#Victor Eijkhout already explained what happened here, I just want to show you a simpler (and data race free) solution.
Note that OpenMP has a significant overhead, in your case the overheads are bigger than the gain by parallelization. So, the best idea is not to use parallelization in this case.
If you do some expensive work inside the loop, the simplest solution is to skip this expensive work if it is not necessary. Note that I have used #pragma omp critical before found = true; to avoid data race.
#pragma omp parallel for
for(int i=0; i<size;i++){
if(found) continue;
// some expensive work here
if(CONDITION){
#pragma omp critical
found = true;
}
}
Another alternative is to use #pragma omp cancel for
#pragma omp parallel
#pragma omp for
for(int i=0; i<size;i++){
#pragma omp cancellation point for
// some expensive work here
if(CONDITION){
//cancelling the for loop
#pragma omp cancel for
}
}

set_num_threads inside parallel not working

I'm struggling to set number of threads to 1 inside of a parallel region. I put a barrier so that all threads stop at that point and I can freely set number of threads to 1 (and there will be no threads executing). But wherever I placed omp_set_num_threads(1) it always returned 3. Is it possible to change number of threads during runtime? How can I do that?
#import<iostream>
#import<omp.h>
#import<stdio.h>
int main(){
int num_of_threads;
std::cin>>num_of_threads;
omp_set_dynamic(0);
#pragma omp parallel if(num_of_threads>1) num_threads(3)
{
int t_id = omp_get_thread_num();
int t_total = omp_get_num_threads();
printf("Current thread id: %d \n Total number_of_threads: %d \n",t_id,t_total);
#pragma omp barrier
#pragma omp single
{
omp_set_num_threads(1);
t_id = omp_get_thread_num();
t_total = omp_get_num_threads();
printf("Single section \n Current thread id: %d \n Total number_of_threads: %d \n",t_id,t_total);
}
}
}
TL;DR You can't change the number of threads in a parallel region.
Remember this is a pool of threads, which get forked at the beginning of the parallel region. Inside they are not even synchronized (if you dont tell them too), thus OpenMP would need to terminate some of them at an unknown position - obviously a bad idea.
Your #pragma omp single makes the following code section be executed by a single thread, thus no need to set it via omp_set_num_threads.
BUT it doesnt change your pool, it just advises the compiler to schedule the following section to one thread - while the rest ignores it.
To show this behavior e.g. for university purposes i would suggest to print out only the thread id in parallel and single part. That way you can already tell it's working or not.

Labeling data for Bag Of Words

I've been looking at this tutorial and the labeling part confuses me. Not the act of labeling itself, but the way the process is shown in the tutorial.
More specifically the #pragma omp sections:
#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
...
#pragma omp critical
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
total_samples++;
}
As well as the following code below it.
Could anyone explain what is going on here?
The pragmas are from OpenMP, a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs.
The #pragma omp parallel for schedule(dynamic,3) is a shorthand that combines several other pragmas. Let's see them:
#pragma omp parallel starts a parellel block with a set of threads that will execute the next stament in parallel.
You can also specify "parallel loops", like a for loop: #pragma omp parallel for. This pragma will split the for-loop between all the threads inside the parallel block and each thread will execute its portion of the loop.
For example:
#pragma omp parallel
{
#pragma omp for
for(int n(0); n < 5; ++n) {
std::cout << "Hello\n";
}
This will create a parallel block that will execute a for-loop. The threads will print to the standard output Hello five times, in no specified order (I mean, thread #3 can print its "Hello" before thread #1 and so.).
Now, you can also schedule which chunk of work will each thread receive. There are several policies: static (the default) and dynamic. Check this awesome answer in regards to scheduling policies.
Now, all of this pragmas can be shortened to one:
#pragma omp parallel for schedule(dynamic,3)
which will create a parallel block that will execute a for-loop, with dynamic scheduling and each thread in the block will execute 3 iterations of the loop before asking the scheduler for more chunks.
The critical pragma will restrict the execution of the next block to a single thread at time. In your example, only one thread at a time will execute this:
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
Here you have an introduction to OpenMP 3.0.
Finally, the variables you mention are specified in the tutorial, just look before your posted code:
vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;
Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));
bowide->setVocabulary(vocabulary);

All OpenMP Tasks running on the same thread

I have wrote a recursive parallel function using tasks in OpenMP. While it gives me the correct answer and runs fine I think there is an issue with the parallelism.The run-time in comparison with a serial solution does not scale in the same other parallel problem I have solved without tasks have. When printing each thread for the tasks they are all running on thread 0. I am compiling and running on Visual Studio Express 2013.
int parallelOMP(int n)
{
int a, b, sum = 0;
int alpha = 0, beta = 0;
for (int k = 1; k < n; k++)
{
a = n - (k*(3 * k - 1) / 2);
b = n - (k*(3 * k + 1) / 2);
if (a < 0 && b < 0)
break;
if (a < 0)
alpha = 0;
else if (p[a] != -1)
alpha = p[a];
if (b < 0)
beta = 0;
else if (p[b] != -1)
beta = p[b];
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
alpha = p[a];
beta = p[b];
}
else if (a > 0 && p[a] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp taskwait
}
}
alpha = p[a];
}
else if (b > 0 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
beta = p[b];
}
if (k % 2 == 0)
sum += -1 * (alpha + beta);
else
sum += alpha + beta;
}
if (sum > 0)
return sum%m;
else
return (m + (sum % m)) % m;
}
Sometimes I wish comments on SO could be as richly formatted as the answers, but alas that's not the case. Therefore, here comes a long comment disguised as an answer.
It appears that a very common mistake in writing recursive OpenMP code is not understanding how exactly parallel regions work. Consider the following code (uses explicit tasks, therefore support for OpenMP 3.0 or newer required):
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp parallel num_threads(2)
{
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
}
}
// somewhere in the main function
par_rec_func(10);
There is a problem with this code. The problem is that, except for the top-level invocation of par_rec_func(), in all other invocations the parallel region will be created in the context of an enclosing outer parallel region. This is called nested parallelism and by default is disabled, which means that all parallel regions beneath the top-level one are going to be inactive, i.e. they will execute serially. Since tasks bind to the innermost parallel region, they will also get executed in serial. What will happen with this code is that it will spawn one additional thread (for a total of two) at the top-level invocation of par_rec_func() and each thread will then execute a whole branch of the recursion tree (i.e. one half of the whole tree). If one runs that code on a machine with 64 cores, 62 of them will idle. In order for the nested parallelism to be enabled, one has to either set the environment variable OMP_NESTED to true or call omp_set_nested() and pass it a true flag:
omp_set_nested(1);
Once nested parallelism has been enabled, one faces a new problem. Every time a nested parallel region is encountered, the encountering thread will either spawn an additional one (because of num_threads(2)) or acquire an idle thread from the runtime's thread pool. At every deeper level of recursion, this program will require twice as many threads as at the previous level. Though an upper limit of the total number of threads could be set via OMP_THREAD_LIMIT (another OpenMP 3.0 feature) and with the overhead aside, this is not what one really wants in such cases.
The correct solution in that case is to use orphaned tasks in the dynamic scope of a single parallel region:
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
// Wait for the child tasks to complete if necessary
#pragma omp taskwait
}
// somewhere in the main function
#pragma omp parallel
{
#pragma omp single
par_rec_func(10);
}
The advantages of this method are many. First of all, only a single parallel region is created with as many threads as specified (e.g. by setting OMP_NUM_THREADS or by any other means). When the child tasks call recursively into par_rec_func(), that simply adds new tasks to the parallel region without spawning new threads. This greatly helps in the case where the recursion tree is not balanced, since many quality OpenMP runtimes implement task stealing, e.g. thread i could execute child tasks of a task that executes in thread j, where i != j.
Given an OpenMP 2.0 compiler like VC++, one cannot do much except to approximate the above idea by using nested parallelism and explicitly disabling it at a certain level:
void par_rec_func (int arg)
{
if (arg <= 0) return;
int level = omp_get_level();
#pragma omp parallel sections num_threads(2) if(level < 4)
{
#pragma omp section
par_rec_func(arg-1);
#pragma omp section
par_rec_func(arg-1);
}
}
// somewhere in the main function
int saved_nested = omp_get_nested();
omp_set_nested(1);
par_rec_func(10);
omp_set_nested(saved_nested);
omp_get_level() is used to determine the level of nesting and the if clause is used to selectively deactivate parallel regions at fourth or deeper level of nesting. This solution is dumb and won't work well when the recursion tree is unbalanced.
Actual Problem:
You are using Visual Studio 2013.
Visual Studio has never supported OMP versions beyond 2.0 (see here).
OMP Tasks are a feature of OMP 3.0 (see spec).
Ergo, using VS at all means no OMP tasks for you.
If OMP Tasks are an essential requirement, use a different compiler. If OMP is not an essential requirement, you should consider an alternative parallel task handling library. Visual Studio includes the MS Concurrency Runtime, and the Parallel Patterns Library built on top of it. I have recently moved from OMP to PPL due to the fact I'm using VS for work; it isn't quite a drop-in replacement but it is quite capable.
My second attempt at solving this, again preserved for historical reasons:
So, the problem is almost certainly that you're defining your omp tasks outside of a omp parallel region.
Here's a contrived example:
void work()
{
#pragma omp parallel
{
#pragma omp single nowait
for (int i = 0; i < 5; i++)
{
#pragma omp task untied
{
std::cout <<
"starting task " << i <<
" on thread " << omp_get_thread_num() << "\n";
sleep(1);
}
}
}
}
If you omit the parallel declaration, the job runs serially:
starting task 0 on thread 0
starting task 1 on thread 0
starting task 2 on thread 0
starting task 3 on thread 0
starting task 4 on thread 0
But if you leave it in:
starting task starting task 3 on thread 1
starting task 0 on thread 3
2 on thread 0
starting task 1 on thread 2
starting task 4 on thread 2
Success, complete with authentic misuse of shared output resources.
(for reference, if you omit the single declaration, each thread will run the loop, resulting in 20 tasks being run on my 4 cpu VM).
Original answer included below for completeness, but no longer relevant!
In every case, your omp task is a single, simple thing. It probably runs and completes immediately:
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
Because you never start one long-running task before firing off the next task, everything will probably run on the first allocated thread.
Perhaps you meant to do something like this?
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
alpha = p[a];
beta = p[b];
}

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".