no parallel threads with openMP - c++

My problem is that I get no parallelization with openMP.
My system:
ubuntu 11.4
Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz
Compiler:
g++ Version: 4.5.2
with flag -fopenmp
With this code I see that there is only one thread:
int nthreads, tid, procs, maxt, inpar, dynamic, nested;
// Start parallel region
#pragma omp parallel private(nthreads, tid) {
// Obtain thread number
tid = omp_get_thread_num();
// Only master thread does this
if (tid == 0)
{
printf("Thread %d getting environment info...\n", tid);
// Get environment information
procs = omp_get_num_procs();
nthreads = omp_get_num_threads();
maxt = omp_get_max_threads();
inpar = omp_in_parallel();
dynamic = omp_get_dynamic();
nested = omp_get_nested();
// Print environment information
printf("Number of processors = %d\n", procs);
printf("Number of threads = %d\n", nthreads);
printf("Max threads = %d\n", maxt);
printf("In parallel? = %d\n", inpar);
printf("Dynamic threads enabled? = %d\n", dynamic);
printf("Nested parallelism supported? = %d\n", nested);
}
}
because I see the following output:
Number of processors = 4
Number of threads = 1
Max threads = 4
In parallel? = 0
Dynamic threads enabled? = 0
Nested parallelism supported? = 0
What is the problem?
Can some one help, please?

Your code works for me on Ubuntu 11.04 with the g++ compiler version 4.5.2 however I had to change
#pragma omp parallel private(nthreads, tid) {
to
#pragma omp parallel private(nthreads, tid)
{
for it to compile successfully.
EDIT: If fixing the syntax doesn't work my next idea would be to ask what is the exact command that you are using to compile code?

#pragma omp parallel private(nthreads, tid) {
is incorrect syntax, as noted by hrandjet
The pragma must end with a new line, so the { should be on the next line.
#pragma omp parallel private(nthreads, tid)
{
This works for me on Windows XP.

Is the output prefaced by
Thread 0 getting environment info...
If not, the problem is as stated above - the open bracket ( { ) must be on a new line. To prove this further, try initializing
int tid = 1
and see if the output still shows up. If not, the #pragma is being ignored by your compiler (probably because of the bracket issue).

Related

How is OpenMP communicating between threads with what should be a private variable?

I'm writing some code in C++ using OpenMP to parallelize some chunks. I run into some strange behavior that I can't quite explain. I've rewritten my code such that it replicates the issue minimally.
First, here is a function I wrote that is to be run in a parallel region.
void foo()
{
#pragma omp for
for (int i = 0; i < 3; i++)
{
#pragma omp critical
printf("Hello %d from thread %d.\n", i, omp_get_thread_num());
}
}
Then here is my whole program.
int main()
{
omp_set_num_threads(4);
#pragma omp parallel
{
for (int i = 0; i < 2; i++)
{
foo();
#pragma omp critical
printf("%d\n", i);
}
}
return 0;
}
When I compile and run this code (with g++ -std=c++17), I get the following output on the terminal:
Hello 0 from thread 0.
Hello 1 from thread 1.
Hello 2 from thread 2.
0
0
Hello 2 from thread 2.
Hello 1 from thread 1.
0
Hello 0 from thread 0.
0
1
1
1
1
i is a private variable. I would expect that the function foo would be run twice per thread. So I would expect to see eight "Hello from %d thread %d.\n" statements in the terminal, just like how I see eight numbers printed when printing i. So what gives here? Why is it that in the same loop, OMP behaves so differently?
It is because #pragma omp for is a worksharing construct, so it will distribute the work among threads and the number of threads used does not matter in this respect, just the number of loop counts (2*3=6).
If you use omp_set_num_threads(1); you also see 6 outputps. If you use more threads than loop counts, some threads will be idle in the inner loop, but you still see exactly 6 outputs.
On the other hand, if you remove #pragma omp for line you will see (number of threads)*2*3 (=24) outputs.
From the documentation of omp parallel:
Each thread in the team executes all statements within a parallel region except for work-sharing constructs.
Emphasis mine. Since the omp for in foo is a work-sharing construct, it is only executed once per outer iteration, no matter how many threads run the parallel block in main.

warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored

I'm currently trying to use OpenMP for parallel computing.
I've written the following basic code.
However it returns the following warning:
warning #2901: [omp] OpenMP is not active; all OpenMP directives will be ignored.
Changing the number of threads does not change the required running time since omp.h is ignored for some reason which is unclear to me.
Can someone help me out?
#include <stdio.h>
#include <omp.h>
#include <math.h>
int main(void)
{
double ts;
double something;
clock_t begin = clock();
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
clock_t end = clock();
ts = (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", ts);
}
In order to get OpenMP support you need to explicitly tell your compiler.
g++, gcc and clang need the option -fopenmp
mvsc needs the option /openmp (more info here if you use visual studio)
Aside from the obvious having to compile with -fopenmp flag your code has some problem worth pointing out, namely:
To measure time use omp_get_wtime() instead of clock() (it will give you the number of clock ticks accumulated across all threads).
The other problem is:
#pragma omp parallel num_threads(4)
#pragma omp parallel for
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}
the iterations of the loop are not being assigned to threads as you wanted. Because you have added again the clause parallel to #pragma omp for, and assuming that you have nested parallelism disabled, which by default it is, each of the threads created in the outer parallel region will execute "sequentially" the code within that region. Consequently, for a n = 6 (i.e., pow(10,7) = 6) and number of threads = 4, you would have the following block of code:
for (int i=0; i<n; i++) {
something=sqrt(123456);
}
being executed 6 x 4 = 24 times (i.e., the total number of loop iterations multiple by the total number of threads). For a more in depth explanation check this SO Thread about a similar issue. Nevertheless, the image below provides a visualization of the essential:
To fix this adapt your code to the following:
#pragma omp parallel for num_threads(4)
for (int i = 0; i<pow(10,7);i++)
{
something=sqrt(123456);
}

Is it possible to make thread join to 'parallel for' region after its job?

I have two jobs that need to run simultaneously at first:
1) for loop that can be parallelized
2) function that can be done with one thread
Now, let me describe what I want to do.
If there exist 8 available threads,
job(1) and job(2) have to run simultaneously at first with 7 threads and 1 thread, respectively.
After job(2) finishes, the thread that job(2) was using should be allocated to job(1) which is the parallel for loop.
I'm using omp_get_thread_num to count how many threads are active in each region. I would expect the the number of threads in job(1) increases by 1 when job(2) finishes.
Below describes a solution that might be wrong or ok:
omp_set_nested(1);
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
}
#pragma omp section // job(1)
{
#pragma omp parallel for schedule(dynamic, 32)
for (int i = 0 ; i < 10000000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
}
}
}
}
How can make the work that I want to achieve be done?
What about something like this?
#pragma omp parallel
{
// note the nowait here so that other threads jump directly to the for loop
#pragma omp single nowait
{
job2();
}
#pragma omp for schedule(dynamic, 32)
for (int i = 0 ; i < 10000000; ++i) {
job1();
}
}
I did not test this but the single will be executed by only one threads while all others will jump directly to the for loop thanks to the nowait.
Also I think it is easier to read than with sections.
Another way (and potentially the better way) to express this would be to use OpenMP tasks:
#pragma omp parallel master
{
#pragma omp task // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
}
#pragma omp taskloop // job(1)
for (int i = 0 ; i < 10000000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
}
}
If you have a compiler that does not understand OpenMP version 5.0, then you have to split the parallel and master:
#pragma omp parallel
#pragma omp master
{
#pragma omp task // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
}
#pragma omp taskloop ]
for (int i = 0 ; i < 10000000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
}
}
The problem comes from synchronization. At the end of the section, omp waits for the termination of all threads and cannot release the thread on job 2 until its completion has been checked.
The solution requires to suppress the synchronization with a nowait.
I did not succeed to suppress synchronization with sections and nested parallelism. I rarely use nested parallel regions, but I think that, while sections can be nowaited, there is a problem when spawning the new nested parallel region inside a section. There is a mandatory synchronization at the end of a parallel section that cannot be suppressed and it probably prevents new threads to join the pool.
What I did is to use a single thread, without synchronization. This way, omp start the single thread and does not wait for its completion to start the parallel for. When the thread finishes its single work, it joins the thread pool to finish processing the for.
#include <omp.h>
#include <stdio.h>
int main() {
int singlethreadid=-1;
// omp_set_nested(1);
#pragma omp parallel
{
#pragma omp single nowait // job(2)
{ // 'printf' is not real job. It is just used for simplicity.
printf("i'm single: %d\n", omp_get_thread_num());
singlethreadid=omp_get_thread_num();
}
#pragma omp for schedule(dynamic, 32)
for (int i = 0 ; i < 100000; ++i) {
// 'printf' is not real job. It is just used for simplicity.
printf("%d\n", omp_get_thread_num());
if (omp_get_thread_num() == singlethreadid)
printf("Hello, I\'m back\n");
}
}
}

OpenMP: pragma cancel for ON NUMA

---------------------EDIT-------------------------
I have edited the code as follows:
#pragma omp parallel for private(i, piold, err) shared(threshold_err) reduction(+:pi) schedule (static)
{
for (i = 0; i < 10000000000; i++){ //1000000000//705035067
piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
err = fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
}
}
}
pi = 4*pi;
I compile it with LLVM3.9/Clang4.0. When I run it with one thread I get expected results with pragma cancel action (checked against non pragma cancel version, resulted in faster run).
But when I run it with threads >=2, the program goes into loop. I am run the code on NUMA machines. What is happening? Perhaps the cancel condition is not being satisfied! But then code takes longer than single thread non-pragma-cancel version!! FYI, it runs file when OMP_CANCELLATION=false.
I have following OpenMP code. I am using LLVM-3.9/Clang-4.0 to compile this code.
#pragma omp parallel private(i, piold, err) shared(pi, threshold_err)
{
#pragma omp for reduction(+:pi) schedule (static)
for (i = 0; i < 10000000 ; i++){
piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
#pragma omp critical
{
err = fabs(pi-piold);// printf("Err: %0.11f\n", err);
}
if ( err < threshold_err){
printf("Cancelling!\n");
#pragma omp cancel for
}
}
}
Unfortunately I do not think the #pragma omp cancel for is terminating the whole for loop. I am printing out the err value in the end, but again with parallelism it is confusing which value is being printed. The final value of err is smaller than threshold_err. The print cancelling is printing but in the very beginning of the program, which is surprising. The program keeps running after that!
How to make sure that this is correct implementation? BTW OMP_CANCELLATION is set to true and a small test program returns '1' for the corresponding function, omp_get_cancellation().
I understand that the omp cancel is just a break signal, it notify so that no thread is created later. Threads which are still running will continue until the end. See http://bisqwit.iki.fi/story/howto/openmp/ and http://jakascorner.com/blog/2016/08/omp-cancel.html
In fact, in my opinion, I see your program product acceptable approximation. However, some variable can be keep in smaller scope. This is my suggestion
#include <iostream>
#include <cmath>
#include <iomanip>
int main() {
long double pi = 0.0;
long double threshold_err = 1e-7;
int cancelFre = 0;
#pragma omp parallel shared(pi, threshold_err, cancelFre)
{
#pragma omp for reduction(+:pi) schedule (static)
for (int i = 0; i < 100000000; i++){
long double piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
long double err = std::fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
cancelFre++;
}
}
}
std::cout << std::setprecision(10) << pi * 4 << " " << cancelFre;
return 0;
}
Okay so I solved it. In my code above the problem was here:
err = fabs(pi-piold);
In the above line pi is changed before the following if condition is changed. Also multiple threads do the same. As I understand this makes program go in a deadlock.
I solved it by forcing only one thread, master, to do this check:
if(omp_get_thread_num()==0){
err = fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
}
}
I could have used #pragma omp single but it gave error about nested pragmas.
Here the performance suffers on low number of threads (1-4 are worse than normal sequential code). After that the performance improves. This is not the best solution and someone can surely improve upon this one.

All OpenMP Tasks running on the same thread

I have wrote a recursive parallel function using tasks in OpenMP. While it gives me the correct answer and runs fine I think there is an issue with the parallelism.The run-time in comparison with a serial solution does not scale in the same other parallel problem I have solved without tasks have. When printing each thread for the tasks they are all running on thread 0. I am compiling and running on Visual Studio Express 2013.
int parallelOMP(int n)
{
int a, b, sum = 0;
int alpha = 0, beta = 0;
for (int k = 1; k < n; k++)
{
a = n - (k*(3 * k - 1) / 2);
b = n - (k*(3 * k + 1) / 2);
if (a < 0 && b < 0)
break;
if (a < 0)
alpha = 0;
else if (p[a] != -1)
alpha = p[a];
if (b < 0)
beta = 0;
else if (p[b] != -1)
beta = p[b];
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
alpha = p[a];
beta = p[b];
}
else if (a > 0 && p[a] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp taskwait
}
}
alpha = p[a];
}
else if (b > 0 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
beta = p[b];
}
if (k % 2 == 0)
sum += -1 * (alpha + beta);
else
sum += alpha + beta;
}
if (sum > 0)
return sum%m;
else
return (m + (sum % m)) % m;
}
Sometimes I wish comments on SO could be as richly formatted as the answers, but alas that's not the case. Therefore, here comes a long comment disguised as an answer.
It appears that a very common mistake in writing recursive OpenMP code is not understanding how exactly parallel regions work. Consider the following code (uses explicit tasks, therefore support for OpenMP 3.0 or newer required):
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp parallel num_threads(2)
{
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
}
}
// somewhere in the main function
par_rec_func(10);
There is a problem with this code. The problem is that, except for the top-level invocation of par_rec_func(), in all other invocations the parallel region will be created in the context of an enclosing outer parallel region. This is called nested parallelism and by default is disabled, which means that all parallel regions beneath the top-level one are going to be inactive, i.e. they will execute serially. Since tasks bind to the innermost parallel region, they will also get executed in serial. What will happen with this code is that it will spawn one additional thread (for a total of two) at the top-level invocation of par_rec_func() and each thread will then execute a whole branch of the recursion tree (i.e. one half of the whole tree). If one runs that code on a machine with 64 cores, 62 of them will idle. In order for the nested parallelism to be enabled, one has to either set the environment variable OMP_NESTED to true or call omp_set_nested() and pass it a true flag:
omp_set_nested(1);
Once nested parallelism has been enabled, one faces a new problem. Every time a nested parallel region is encountered, the encountering thread will either spawn an additional one (because of num_threads(2)) or acquire an idle thread from the runtime's thread pool. At every deeper level of recursion, this program will require twice as many threads as at the previous level. Though an upper limit of the total number of threads could be set via OMP_THREAD_LIMIT (another OpenMP 3.0 feature) and with the overhead aside, this is not what one really wants in such cases.
The correct solution in that case is to use orphaned tasks in the dynamic scope of a single parallel region:
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
// Wait for the child tasks to complete if necessary
#pragma omp taskwait
}
// somewhere in the main function
#pragma omp parallel
{
#pragma omp single
par_rec_func(10);
}
The advantages of this method are many. First of all, only a single parallel region is created with as many threads as specified (e.g. by setting OMP_NUM_THREADS or by any other means). When the child tasks call recursively into par_rec_func(), that simply adds new tasks to the parallel region without spawning new threads. This greatly helps in the case where the recursion tree is not balanced, since many quality OpenMP runtimes implement task stealing, e.g. thread i could execute child tasks of a task that executes in thread j, where i != j.
Given an OpenMP 2.0 compiler like VC++, one cannot do much except to approximate the above idea by using nested parallelism and explicitly disabling it at a certain level:
void par_rec_func (int arg)
{
if (arg <= 0) return;
int level = omp_get_level();
#pragma omp parallel sections num_threads(2) if(level < 4)
{
#pragma omp section
par_rec_func(arg-1);
#pragma omp section
par_rec_func(arg-1);
}
}
// somewhere in the main function
int saved_nested = omp_get_nested();
omp_set_nested(1);
par_rec_func(10);
omp_set_nested(saved_nested);
omp_get_level() is used to determine the level of nesting and the if clause is used to selectively deactivate parallel regions at fourth or deeper level of nesting. This solution is dumb and won't work well when the recursion tree is unbalanced.
Actual Problem:
You are using Visual Studio 2013.
Visual Studio has never supported OMP versions beyond 2.0 (see here).
OMP Tasks are a feature of OMP 3.0 (see spec).
Ergo, using VS at all means no OMP tasks for you.
If OMP Tasks are an essential requirement, use a different compiler. If OMP is not an essential requirement, you should consider an alternative parallel task handling library. Visual Studio includes the MS Concurrency Runtime, and the Parallel Patterns Library built on top of it. I have recently moved from OMP to PPL due to the fact I'm using VS for work; it isn't quite a drop-in replacement but it is quite capable.
My second attempt at solving this, again preserved for historical reasons:
So, the problem is almost certainly that you're defining your omp tasks outside of a omp parallel region.
Here's a contrived example:
void work()
{
#pragma omp parallel
{
#pragma omp single nowait
for (int i = 0; i < 5; i++)
{
#pragma omp task untied
{
std::cout <<
"starting task " << i <<
" on thread " << omp_get_thread_num() << "\n";
sleep(1);
}
}
}
}
If you omit the parallel declaration, the job runs serially:
starting task 0 on thread 0
starting task 1 on thread 0
starting task 2 on thread 0
starting task 3 on thread 0
starting task 4 on thread 0
But if you leave it in:
starting task starting task 3 on thread 1
starting task 0 on thread 3
2 on thread 0
starting task 1 on thread 2
starting task 4 on thread 2
Success, complete with authentic misuse of shared output resources.
(for reference, if you omit the single declaration, each thread will run the loop, resulting in 20 tasks being run on my 4 cpu VM).
Original answer included below for completeness, but no longer relevant!
In every case, your omp task is a single, simple thing. It probably runs and completes immediately:
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
Because you never start one long-running task before firing off the next task, everything will probably run on the first allocated thread.
Perhaps you meant to do something like this?
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
alpha = p[a];
beta = p[b];
}