set_num_threads inside parallel not working

set_num_threads inside parallel not working - c++

I'm struggling to set number of threads to 1 inside of a parallel region. I put a barrier so that all threads stop at that point and I can freely set number of threads to 1 (and there will be no threads executing). But wherever I placed omp_set_num_threads(1) it always returned 3. Is it possible to change number of threads during runtime? How can I do that?
#import<iostream>
#import<omp.h>
#import<stdio.h>
int main(){
int num_of_threads;
std::cin>>num_of_threads;
omp_set_dynamic(0);
#pragma omp parallel if(num_of_threads>1) num_threads(3)
{
int t_id = omp_get_thread_num();
int t_total = omp_get_num_threads();
printf("Current thread id: %d \n Total number_of_threads: %d \n",t_id,t_total);
#pragma omp barrier
#pragma omp single
{
omp_set_num_threads(1);
t_id = omp_get_thread_num();
t_total = omp_get_num_threads();
printf("Single section \n Current thread id: %d \n Total number_of_threads: %d \n",t_id,t_total);
}
}
}

TL;DR You can't change the number of threads in a parallel region.
Remember this is a pool of threads, which get forked at the beginning of the parallel region. Inside they are not even synchronized (if you dont tell them too), thus OpenMP would need to terminate some of them at an unknown position - obviously a bad idea.
Your #pragma omp single makes the following code section be executed by a single thread, thus no need to set it via omp_set_num_threads.
BUT it doesnt change your pool, it just advises the compiler to schedule the following section to one thread - while the rest ignores it.
To show this behavior e.g. for university purposes i would suggest to print out only the thread id in parallel and single part. That way you can already tell it's working or not.

Related

How is OpenMP communicating between threads with what should be a private variable?

I'm writing some code in C++ using OpenMP to parallelize some chunks. I run into some strange behavior that I can't quite explain. I've rewritten my code such that it replicates the issue minimally.
First, here is a function I wrote that is to be run in a parallel region.
void foo()
{
#pragma omp for
for (int i = 0; i < 3; i++)
{
#pragma omp critical
printf("Hello %d from thread %d.\n", i, omp_get_thread_num());
}
}
Then here is my whole program.
int main()
{
omp_set_num_threads(4);
#pragma omp parallel
{
for (int i = 0; i < 2; i++)
{
foo();
#pragma omp critical
printf("%d\n", i);
}
}
return 0;
}
When I compile and run this code (with g++ -std=c++17), I get the following output on the terminal:
Hello 0 from thread 0.
Hello 1 from thread 1.
Hello 2 from thread 2.
0
0
Hello 2 from thread 2.
Hello 1 from thread 1.
0
Hello 0 from thread 0.
0
1
1
1
1
i is a private variable. I would expect that the function foo would be run twice per thread. So I would expect to see eight "Hello from %d thread %d.\n" statements in the terminal, just like how I see eight numbers printed when printing i. So what gives here? Why is it that in the same loop, OMP behaves so differently?

It is because #pragma omp for is a worksharing construct, so it will distribute the work among threads and the number of threads used does not matter in this respect, just the number of loop counts (2*3=6).
If you use omp_set_num_threads(1); you also see 6 outputps. If you use more threads than loop counts, some threads will be idle in the inner loop, but you still see exactly 6 outputs.
On the other hand, if you remove #pragma omp for line you will see (number of threads)*2*3 (=24) outputs.

From the documentation of omp parallel:
Each thread in the team executes all statements within a parallel region except for work-sharing constructs.
Emphasis mine. Since the omp for in foo is a work-sharing construct, it is only executed once per outer iteration, no matter how many threads run the parallel block in main.

reduction with string type in OpenMP

I am use OpenMP to parallize a for loop like so
std::stringType = "somevalue";
#pragma omp parallel for reduction(+ : stringType)
//a for loop here which every loop appends a string to stringType
The only way I can think to do this is to convert to an int representation in some way first and then convert back at the end but this has obvious overhead. Is there any better ways to perform this style of operation?

As mentioned in comments, reduction assumes that the operation is associative and commutative. The values may be computed in any order and be "accumulated" through any kind of partial results and the final result will be the same.
There is no guarantee that an OpenMP for loop will distribute contiguous iterations to each thread unless the loop schedule explicitly requests that. There is no guarantee either that continuous blocks will be distributed by increasing thread number (i.e. thread #0 might go through iterations 1000-1999 while thread #1 goes through 0-999). If you need that behavior, then you should define you own schedule.
Something like:
int N=1000;
std::string globalString("initial value");
#pragma omp parallel shared(N,stringType)
{
std::string localString; //Empty string
// Set schedule
int iterTo, iterFrom;
iterFrom = omp_get_thread_num() * (N / omp_get_num_threads());
if (omp_get_num_threads() == omp_get_thread_num()+1)
iterTo = N;
else
iterTo = (1+omp_get_thread_num()) * (N / omp_get_num_threads());
// Loop - concatenate a number of neighboring values in the right order
// No #pragma omp for: each thread goes through the loop, but loop
// boundaries change according to the thread ID
for (int ii=iterTo; ii<iterTo ; ii++){
localString += get_some_string(ii);
}
// Dirty trick to concatenate strings from all threads in the good order
for (int ii=0;ii<omp_get_num_threads();ii++){
#pragma omp barrier
if (ii==omp_get_thread_num())
globalString += localString;
}
}
A better way would be to have a shared array of std::string, each thread using one as a local accumulator. At the end, a single thread can run the concatenation part (and avoid the dirty trick and all its overhead-heavy barrier calls).

openmp distribute threads to certain code blocks

In my program I need to divide n threads in following way:
1) Thread1 is doing his specific work
2) n-1 other threads do their own work
If number_of _threads == 1 then only action 1) is done.
Actions 1) and 2) are computed in parallel
void main(){
int number_of_threads;
std::cin>>number_of_threads;
omp_set_num_threads(number_of_threads);
#pragma omp parallel if (number_of_threads>1)
{
#pragma omp master // no barrier at the end of master block
single_calc();
#pragma omp ??(number_of_threads-1) //second block
// this section of code is computed by n-1 processes
}
}
I came up to following solutions
1)hardcode so that thread with id == 1 doesn't compute the second block
2)as far as master thread has id = 0 I can use #pragma omp for starting with i=1 in the second block
3) call single_calc() outside #pragma omp parallel (but I want to control amount of threads calculating this block)
Is there any elegant solution to this?

Labeling data for Bag Of Words

I've been looking at this tutorial and the labeling part confuses me. Not the act of labeling itself, but the way the process is shown in the tutorial.
More specifically the #pragma omp sections:
#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
...
#pragma omp critical
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
total_samples++;
}
As well as the following code below it.
Could anyone explain what is going on here?

The pragmas are from OpenMP, a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs.
The #pragma omp parallel for schedule(dynamic,3) is a shorthand that combines several other pragmas. Let's see them:
#pragma omp parallel starts a parellel block with a set of threads that will execute the next stament in parallel.
You can also specify "parallel loops", like a for loop: #pragma omp parallel for. This pragma will split the for-loop between all the threads inside the parallel block and each thread will execute its portion of the loop.
For example:
#pragma omp parallel
{
#pragma omp for
for(int n(0); n < 5; ++n) {
std::cout << "Hello\n";
}
This will create a parallel block that will execute a for-loop. The threads will print to the standard output Hello five times, in no specified order (I mean, thread #3 can print its "Hello" before thread #1 and so.).
Now, you can also schedule which chunk of work will each thread receive. There are several policies: static (the default) and dynamic. Check this awesome answer in regards to scheduling policies.
Now, all of this pragmas can be shortened to one:
#pragma omp parallel for schedule(dynamic,3)
which will create a parallel block that will execute a for-loop, with dynamic scheduling and each thread in the block will execute 3 iterations of the loop before asking the scheduler for more chunks.
The critical pragma will restrict the execution of the next block to a single thread at time. In your example, only one thread at a time will execute this:
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
Here you have an introduction to OpenMP 3.0.
Finally, the variables you mention are specified in the tutorial, just look before your posted code:
vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;
Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));
bowide->setVocabulary(vocabulary);

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.

#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}

The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

set_num_threads inside parallel not working - c++

Related

How is OpenMP communicating between threads with what should be a private variable?

reduction with string type in OpenMP

openmp distribute threads to certain code blocks

Labeling data for Bag Of Words

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Categories

Resources