OpenMP set_num_threads() is not working - c++

I am writing a parallel program using OpenMP in C++.
I want to control the number of threads in the program using omp_set_num_threads(), but it does not work.
#include <iostream>
#include <omp.h>
#include "mpi.h"
using namespace std;
int myrank;
int groupsize;
double sum;
double t1,t2;
int n = 10000000;
int main(int argc, char *argv[])
{
MPI_Init( &argc, &argv);
MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
MPI_Comm_size(MPI_COMM_WORLD,&groupsize);
omp_set_num_threads(4);
sum = 0;
#pragma omp for reduction(+:sum)
for (int i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
cout<<"threads="<<omp_get_num_threads()<<endl;
MPI_Finalize();
return 0;
}
The program outputs:
sum = 4.5e+007
threads=1
How to control the number of threads?

Besides calling omp_get_num_threads() outside of the parallel region in your case, calling omp_set_num_threads() still doesn't guarantee that the OpenMP runtime will use exactly the specified number of threads. omp_set_num_threads() is used to override the value of the environment variable OMP_NUM_THREADS and they both control the upper limit of the size of the thread team that OpenMP would spawn for all parallel regions (in the case of OMP_NUM_THREADS) or for any consequent parallel region (after a call to omp_set_num_threads()). There is something called dynamic teams that could still pick smaller number of threads if the run-time system deems it more appropriate. You can disable dynamic teams by calling omp_set_dynamic(0) or by setting the environment variable OMP_DYNAMIC to false.
To enforce a given number of threads you should disable dynamic teams and specify the desired number of threads with either omp_set_num_threads():
omp_set_dynamic(0); // Explicitly disable dynamic teams
omp_set_num_threads(4); // Use 4 threads for all consecutive parallel regions
#pragma omp parallel ...
{
... 4 threads used here ...
}
or with the num_threads OpenMP clause:
omp_set_dynamic(0); // Explicitly disable dynamic teams
// Spawn 4 threads for this parallel region only
#pragma omp parallel ... num_threads(4)
{
... 4 threads used here ...
}

The omp_get_num_threads() function returns the number of threads that are currently in the team executing the parallel region from which it is called. You are calling it outside of the parallel region, which is why it returns 1.

According to the GCC manual for omp_get_num_threads:
In a sequential section of the program omp_get_num_threads returns 1
So this:
cout<<"sum="<<sum<<endl;
cout<<"threads="<<omp_get_num_threads()<<endl;
Should be changed to something like:
#pragma omp parallel
{
cout<<"sum="<<sum<<endl;
cout<<"threads="<<omp_get_num_threads()<<endl;
}
The code I use follows Hristo's advice of disabling dynamic teams, too.

I was facing the same problem . Solution is given below
Right click on Source Program > Properties > Configuration Properties > C/C++ > Language > Now change Open MP support flag to Yes....
You will get the desired result.

Related

warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored

I'm currently trying to use OpenMP for parallel computing.
I've written the following basic code.
However it returns the following warning:
warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored.
Changing the number of threads does not change the required running time since omp.h is ignored for some reason which is unclear to me.
Can someone help me out?
#include <stdio.h>
#include <omp.h>
#include <math.h>
int main(void)
{
double ts;
double something;
clock_t begin = clock();
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
clock_t end = clock();
ts = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", ts);
}
In order to get OpenMP support you need to explicitly tell your compiler.
g++, gcc and clang need the option -fopenmp
mvsc needs the option /openmp (more info here if you use visual studio)
Aside from the obvious having to compile with -fopenmp flag your code has some problem worth pointing out, namely:
To measure time use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads).
The other problem is:
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region. Consequently, for a n = 6 (i.e., pow(10,7) = 6) and number of threads = 4, you would have the following block of code:
for (int i=0; i<n; i++) {
something=sqrt(123456);
}
being executed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads). For a more in depth explanation check this SO Thread about a similar issue. Nevertheless, the image below provides a visualization of the essential:
To fix this adapt your code to the following:
#pragma omp parallel for num_threads(4)
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}

Labeling data for Bag Of Words

I've been looking at this tutorial and the labeling part confuses me. Not the act of labeling itself, but the way the process is shown in the tutorial.
More specifically the #pragma omp sections:
#pragma omp parallel for schedule(dynamic,3)
for(..loop a directory?..) {
...
#pragma omp critical
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
total_samples++;
}
As well as the following code below it.
Could anyone explain what is going on here?
The pragmas are from OpenMP, a specification for a set of compiler directives, library routines, and environment variables that can be used to specify high-level parallelism in Fortran and C/C++ programs.
The #pragma omp parallel for schedule(dynamic,3) is a shorthand that combines several other pragmas. Let's see them:
#pragma omp parallel starts a parellel block with a set of threads that will execute the next stament in parallel.
You can also specify "parallel loops", like a for loop: #pragma omp parallel for. This pragma will split the for-loop between all the threads inside the parallel block and each thread will execute its portion of the loop.
For example:
#pragma omp parallel
{
#pragma omp for
for(int n(0); n < 5; ++n) {
std::cout << "Hello\n";
}
This will create a parallel block that will execute a for-loop. The threads will print to the standard output Hello five times, in no specified order (I mean, thread #3 can print its "Hello" before thread #1 and so.).
Now, you can also schedule which chunk of work will each thread receive. There are several policies: static (the default) and dynamic. Check this awesome answer in regards to scheduling policies.
Now, all of this pragmas can be shortened to one:
#pragma omp parallel for schedule(dynamic,3)
which will create a parallel block that will execute a for-loop, with dynamic scheduling and each thread in the block will execute 3 iterations of the loop before asking the scheduler for more chunks.
The critical pragma will restrict the execution of the next block to a single thread at time. In your example, only one thread at a time will execute this:
{
if(classes_training_data.count(class_) == 0) { //not yet created...
classes_training_data[class_].create(0,response_hist.cols,response_hist.type());
classes_names.push_back(class_);
}
classes_training_data[class_].push_back(response_hist);
}
Here you have an introduction to OpenMP 3.0.
Finally, the variables you mention are specified in the tutorial, just look before your posted code:
vector<KeyPoint> keypoints;
Mat response_hist;
Mat img;
string filepath;
map<string,Mat> classes_training_data;
Ptr<FeatureDetector > detector(new SurfFeatureDetector());
Ptr<DescriptorMatcher > matcher(new BruteForceMatcher<L2<float> >());
Ptr<DescriptorExtractor > extractor(new OpponentColorDescriptorExtractor(Ptr<DescriptorExtractor>(new SurfDescriptorExtractor())));
Ptr<BOWImgDescriptorExtractor> bowide(new BOWImgDescriptorExtractor(extractor,matcher));
bowide->setVocabulary(vocabulary);

Strange behavior when mixing openMP with openMPI

I have some code that is parallelized using openMP (on a for loop). I wanted to now repeat the functionality several times and use MPI to submit to a cluster of machines, keeping the intra node stuff to all still be openMP.
When I only use openMP, I get the speed up I expect (using twice the number of processors/cores finishes in half the time). When I add in the MPI and submit to only one MPI process, I do not get this speed up. I created a toy problem to check this and still have the same issue. Here is the code
#include <iostream>
#include <stdio.h>
#include <unistd.h>
#include "mpi.h"
#include <omp.h>
int main(int argc, char *argv[]) {
int iam=0, np = 1;
long i;
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
double t1 = MPI_Wtime();
std::cout << "!!!Hello World!!!" << std::endl; // prints !!!Hello World!!!
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
int nThread = omp_get_num_procs();//omp_get_num_threads here returns 1??
printf("nThread = %d\n", nThread);
int *total = new int[nThread];
for (int j=0;j<nThread;j++) {
total[j]=0;
}
#pragma omp parallel num_threads(nThread) default(shared) private(iam, i)
{
np = omp_get_num_threads();
#pragma omp for schedule(dynamic, 1)
for (i=0; i<10000000; i++) {
iam = omp_get_thread_num();
total[iam]++;
}
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs,processor_name);
}
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
printf("GrandTotal= %d\n", grandTotal);
MPI_Finalize();
double t2 = MPI_Wtime();
printf("time elapsed with MPI clock=%f\n", t2-t1);
return 0;
}
I am compiling with openmpi-1.8/bin/mpic++, using the -fopenmp flag. Here is my PBS script
#PBS -l select=1:ncpus=12
setenv OMP_NUM_THREADS 12
/util/mpi/openmpi-1.8/bin/mpirun -np 1 -hostfile $PBS_NODEFILE --map-by node:pe=$OMP_NUM_THREADS /workspace/HelloWorldMPI/HelloWorldMPI
I have also tried with #PBS -l nodes=1:ppn=12, get the same results.
When using half the cores, the program is actually faster (twice as fast!). When I reduce the number of cores, I change both ncpus and OMP_NUM_THREADS. I have tried increasing the actual work (adding 10^10 numbers instead of 10^7 shown here in the code). I have tried removing the printf statements wondering if they were somehow slowing things down, still have the same problem. Top shows that I am using all the CPUs (as set in ncpus) close to 100%. If I submit with -np=2, it parallelizes beautifully on two machines, so the MPI seems to be working as expected, but the openMP is broken
Out of ideas now, anything I can try. What am I doing wrong?
I hate to say it, but there's a lot wrong and you should probably just familiarize
yourself more with OpenMP and MPI. Nevertheless, I'll try to go through your code
and point out the errors I saw.
double t1 = MPI_Wtime();
Starting out: Calling MPI_Wtime() before MPI_Init() is undefined. I'll also add that if you
want to do this sort of benchmark with MPI, a good idea is to put a MPI_Barrier() before
the call to Wtime so that all the tasks enter the section at the same time.
//omp_get_num_threads here returns 1??
The reason why omp_get_num_threads() returns 1 is that you are not in a
parallel region.
#pragma omp parallel num_threads(nThread)
You set num_threads to nThread here which as Hristo Iliev mentioned, effectively
ignores any input through the OMP_NUM_THREADS environment variable. You can usually just
leave num_threads out and be ok for this sort of simplified problem.
default(shared)
The behavior of variables in the parallel region is by default shared, so there's
no reason to have default(shared) here.
private(iam, i)
I guess it's your coding style, but instead of making iam and i private, you could
just declare them within the parallel region, which will automatically make them private
(and considering you don't really use them outside of it, there's not much reason not to).
#pragma omp for schedule(dynamic, 1)
Also as Hristo Iliev mentioned, using schedule(dynamic, 1) for this problem set in particular
is not the best of ideas, since each iteration of your loop takes virtually no time
and the total problem size is fixed.
int grandTotal=0;
for (int j=0;j<nThread;j++) {
printf("Total=%d\n",total[j]);
grandTotal += total[j];
}
Not necessarily an error, but your allocation of the total array and summation at the end
is better accomplished using the OpenMP reduction directive.
double t2 = MPI_Wtime();
Similar to what you did with MPI_Init(), calling MPI_Wtime() after you've
called MPI_Finalize() is undefined, and should be avoided if possible.
Note: If you are somewhat familiar with what OpenMP is, this
is a good reference and basically everything I explained here about OpenMP is in there.
With that out of the way, I have to note you didn't actually do anything with MPI,
besides output the rank and comm size. Which is to say, all the MPI tasks
do a fixed amount of work each, regardless of the number tasks. Since there's
no decrease in work-per-task for an increasing number of MPI tasks, you wouldn't expect
to have any scaling, would you? (Note: this is actually what's called Weak Scaling, but since you have no communication via MPI, there's no reason to expect it to not
scale perfectly).
Here's your code rewritten with some of the changes I mentioned:
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int world_size,
world_rank;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int name_len;
char proc_name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name(proc_name, &name_len);
MPI_Barrier(MPI_COMM_WORLD);
double t_start = MPI_Wtime();
// we need to scale the work per task by number of mpi threads,
// otherwise we actually do more work with the more tasks we have
const int n_iterations = 1e7 / world_size;
// actually we also need some dummy data to add so the compiler doesn't just
// optimize out the work loop with -O3 on
int data[16];
for (int i = 0; i < 16; ++i)
data[i] = rand() % 16;
// reduction(+:total) means that all threads will make a private
// copy of total at the beginning of this construct and then
// do a reduction operation with the + operator at the end (aka sum them
// all together)
unsigned int total = 0;
#pragma omp parallel reduction(+:total)
{
// both of these calls will execute properly since we
// are in an omp parallel region
int n_threads = omp_get_num_threads(),
thread_id = omp_get_thread_num();
// note: this code will only execute on a single thread (per mpi task)
#pragma omp master
{
printf("nThread = %d\n", n_threads);
}
#pragma omp for
for (int i = 0; i < n_iterations; i++)
total += data[i % 16];
printf("Hello from thread %d out of %d from process %d out of %d on %s\n",
thread_id, n_threads, world_rank, world_size, proc_name);
}
// do a reduction with MPI, otherwise the data we just calculated is useless
unsigned int grand_total;
MPI_Allreduce(&total, &grand_total, 1, MPI_UNSIGNED, MPI_SUM, MPI_COMM_WORLD);
// another barrier to make sure we wait for the slowest task
MPI_Barrier(MPI_COMM_WORLD);
double t_end = MPI_Wtime();
// output individual thread totals
printf("Thread total = %d\n", total);
// output results from a single thread
if (world_rank == 0)
{
printf("Grand Total = %d\n", grand_total);
printf("Time elapsed with MPI clock = %f\n", t_end - t_start);
}
MPI_Finalize();
return 0;
}
Another thing to note, my version of your code executed 22 times slower with schedule(dynamic, 1) added, just to show you how it can impact performance when used incorrectly.
Unfortunately I'm not too familiar with PBS, as the clusters I use run with SLURM but an example sbatch file for a job running on 3 nodes, on a system with two 6-core processors per node, might look something like this:
#!/bin/bash
#SBATCH --job-name=TestOrSomething
#SBATCH --export=ALL
#SBATCH --partition=debug
#SBATCH --nodes=3
#SBATCH --ntasks-per-socket=1
# set 6 processes per thread here
export OMP_NUM_THREADS=6
# note that this will end up running 3 * (however many cpus
# are on a single node) mpi tasks, not just 3. Additionally
# the below line might use `mpirun` instead depending on the
# cluster
srun ./a.out
For fun, I also just ran my version on a cluster to test the scaling for MPI and OMP, and got the following (note the log scales):
As you can see, its basically perfect. Actually, 1-16 is 1 MPI task with 1-16 OMP threads, and 16-256 is 1-16 MPI tasks with 16 threads per task, so you can also see that there's no change in behavior between the MPI scaling and OMP scaling.

C++ + openmp for parallel computing: how to set up in visual studio?

I have a c++ program that creates an object and then calls 2 functions of this object that are independent from one another. So it looks like this:
Object myobject(arg1, arg2);
double answer1 = myobject.function1();
double answer2 = myobject.function2();
I would like to have those 2 computations run in parallel to save computation time. I've seen that this could be done using openmp, but couldn't figure out how to set it up. The only examples I found were sending the same calculation (i.e. "hello world!" for example) to the different cores and the output was 2 times "hello world!". How can I do it in this situation?
I use Windows XP with Visual Studio 2005.
You should look into the sections construct of OpenMP. It works like this:
#pragma omp parallel sections
{
#pragma omp section
{
... section 1 block ...
}
#pragma omp section
{
... section 2 block ...
}
}
Both blocks might execute in parallel given that there are at least two threads in the team but it is up to the implementation to decide how and where to execute each section.
There is a cleaner solution using OpenMP tasks, but it requires that your compiler supports OpenMP 3.0. MSVC only supports OpenMP 2.0 (even in VS 11!).
You should explicitly enable OpenMP support in your project's settings. If you are doing compilation from the command line, the option is /openmp.
If the memory that is is required for your code is not a lot you can use MPI library too. For this purpose first of all install MPI on your visual studio from this tutorial Compiling MPI Programs in Visual Studio
or from here:MS-MPI with Visual Studio 2008
use this mpi hello world code :
#include<iostream>
#include<mpi.h>
using namespace std;
int main(int argc, char** argv){
int mynode, totalnodes;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &totalnodes);
MPI_Comm_rank(MPI_COMM_WORLD, &mynode);
cout << "Hello world from process " << mynode;
cout << " of " << totalnodes << endl;
MPI_Finalize();
return 0;
}
for your base code, add your functions to it and declare job of each process with this example if statement:
if(mynode== 0 ){function1}
if(mynode== 1 ){function2}
function1 and function2 can be any thing that you like executes at the same time; but be careful that these two functions independent of each others.
thats it!
The first part of this is getting OpenMP up and running with Visual Studio 2005, which is quite old; it takes some doing, but it's described in the answer to this question.
Once that's done, it's fairly easy to do this simple form of task parallelism if you have two methods which are genuinely completely independant. Note that qualifier; if the methods are reading the same data, that's ok, but if they're updating any state that the other method uses, or calling any other routines that do so, then things will break.
As long as the methods are completly independant, you can use sections for these (tasks are actually the more modern, OpenMP 3.0 way of doing this, but you probably won't be able to get OpenMP 3.0 support for such an old compiler); you will also see people misusing parallel for loops to achieve this, which at least has the advantage of letting you control the thread assignments, so I include that here for completeness even though I can't really recommend it:
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
int f1() {
int tid = omp_get_thread_num();
printf("Thread %d in function f1.\n", tid);
sleep(rand()%10);
return 1;
}
int f2() {
int tid = omp_get_thread_num();
printf("Thread %d in function f2.\n", tid);
sleep(rand()%10);
return 2;
}
int main (int argc, char **argv) {
int answer;
int ans1, ans2;
/* using sections */
#pragma omp parallel num_threads(2) shared(ans1, ans2, answer) default(none)
{
#pragma omp sections
{
#pragma omp section
ans1 = f1();
#pragma omp section
ans2 = f2();
}
#pragma omp single
answer = ans1+ans2;
}
printf("Answer = %d\n", answer);
/* hacky appraoch, mis-using for loop */
answer = 0;
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:answer) default(none)
for (int i=0; i<2; i++) {
if (i==0)
answer += f1();
if (i==1)
answer += f2();
}
printf("Answer = %d\n", answer);
return 0;
}
Running this gives
$ ./sections
Thread 0 in function f1.
Thread 1 in function f2.
Answer = 3
Thread 0 in function f1.
Thread 1 in function f2.
Answer = 3

the OpenMP "master" pragma must not be enclosed by the "parallel for" pragma

Why won't the intel compiler let me specify that some actions in an openmp parallel for block should be executed by the master thread only?
And how can I do what I'm trying to achieve without this kind of functionality?
What I'm trying to do is update a progress bar through a callback in a parallel for:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
#pragma omp master
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
I want only the master thread to call the callback, because if I don't enforce that (say by using omp critical instead to ensure only one thread uses the callback at once) I get the following runtime exception:
The application called an interface that was marshalled for a different thread.
...hence the desire to keep all callbacks in the master thread.
Thanks in advance.
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
Compiler Error C3034
OpenMP 'master' directive cannot be directly nested within 'parallel for' directive
Visual Studio 2010 OpenMP 2.0
May be so:
long num_items_computed = 0;
#pragma omp parallel for schedule (guided)
for (...a range of items...)
{
//update item count
#pragma omp atomic
num_items_computed++;
//update progress bar with number of items computed
//master thread only due to com marshalling
//#pragma omp master it is error
//#pragma omp critical it is right
if (omp_get_thread_num() == 0) // may be good
set_progressor_callback(num_items_computed);
//actual computation goes here
...blah...
}
The reason why you get the error is because the master thread isn't there most of the times when the code reaches the #pragma omp master line.
For example, let's take the code from Artyom:
#include <omp.h>
void f(){}
int main()
{
#pragma omp parallel for schedule (guided)
for (int i = 0; i < 100; ++i)
{
#pragma omp master
f();
}
return 0;
}
If the code would compile, the following could happen:
Let's say thread 0 starts (the master thread). It reaches the pragma that practically says "Master, do the following piece of code". It being the master can run the function.
However, what happens when thread 1 or 2 or 3, etc, reaches that piece of code?
The master directive is telling the present/listening team that the master thread has to execute f(). But the team is a single thread and there is no master present. The program wouldn't know what to do past that point.
And that's why, I think, the master isn't allowed to be inside the for-loop.
Substituting the master directive with if (omp_get_thread_num() == 0) works because now the program says, "If you are master, do this. Otherwise ignore".