OpenMP: pragma cancel for ON NUMA - c++

---------------------EDIT-------------------------
I have edited the code as follows:
#pragma omp parallel for private(i, piold, err) shared(threshold_err) reduction(+:pi) schedule (static)
{
for (i = 0; i < 10000000000; i++){ //1000000000//705035067
piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
err = fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
}
}
}
pi = 4*pi;
I compile it with LLVM3.9/Clang4.0. When I run it with one thread I get expected results with pragma cancel action (checked against non pragma cancel version, resulted in faster run).
But when I run it with threads >=2, the program goes into loop. I am run the code on NUMA machines. What is happening? Perhaps the cancel condition is not being satisfied! But then code takes longer than single thread non-pragma-cancel version!! FYI, it runs file when OMP_CANCELLATION=false.
I have following OpenMP code. I am using LLVM-3.9/Clang-4.0 to compile this code.
#pragma omp parallel private(i, piold, err) shared(pi, threshold_err)
{
#pragma omp for reduction(+:pi) schedule (static)
for (i = 0; i < 10000000 ; i++){
piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
#pragma omp critical
{
err = fabs(pi-piold);// printf("Err: %0.11f\n", err);
}
if ( err < threshold_err){
printf("Cancelling!\n");
#pragma omp cancel for
}
}
}
Unfortunately I do not think the #pragma omp cancel for is terminating the whole for loop. I am printing out the err value in the end, but again with parallelism it is confusing which value is being printed. The final value of err is smaller than threshold_err. The print cancelling is printing but in the very beginning of the program, which is surprising. The program keeps running after that!
How to make sure that this is correct implementation? BTW OMP_CANCELLATION is set to true and a small test program returns '1' for the corresponding function, omp_get_cancellation().

I understand that the omp cancel is just a break signal, it notify so that no thread is created later. Threads which are still running will continue until the end. See http://bisqwit.iki.fi/story/howto/openmp/ and http://jakascorner.com/blog/2016/08/omp-cancel.html
In fact, in my opinion, I see your program product acceptable approximation. However, some variable can be keep in smaller scope. This is my suggestion
#include <iostream>
#include <cmath>
#include <iomanip>
int main() {
long double pi = 0.0;
long double threshold_err = 1e-7;
int cancelFre = 0;
#pragma omp parallel shared(pi, threshold_err, cancelFre)
{
#pragma omp for reduction(+:pi) schedule (static)
for (int i = 0; i < 100000000; i++){
long double piold = pi;
pi += (((i&1) == false) ? 1.0 : -1.0)/(2*i+1);
long double err = std::fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
cancelFre++;
}
}
}
std::cout << std::setprecision(10) << pi * 4 << " " << cancelFre;
return 0;
}

Okay so I solved it. In my code above the problem was here:
err = fabs(pi-piold);
In the above line pi is changed before the following if condition is changed. Also multiple threads do the same. As I understand this makes program go in a deadlock.
I solved it by forcing only one thread, master, to do this check:
if(omp_get_thread_num()==0){
err = fabs(pi-piold);
if ( err < threshold_err){
#pragma omp cancel for
}
}
I could have used #pragma omp single but it gave error about nested pragmas.
Here the performance suffers on low number of threads (1-4 are worse than normal sequential code). After that the performance improves. This is not the best solution and someone can surely improve upon this one.

Related

Data race with OpenMP and Eigen but data is read-only?

I am trying to find the data race in my code but I just can't seem to grasp why it happens. The data in the threads is used read-only and the only variable that is written to is protected by a critical region.
I tried using the Intel Inspector but I am compiling with g++ 9.3.0 and apparently even the 2021 version can't deal with the OpenMP implementation for it. The release notes do not explicitly state it as exception as it was for older versions but there is a warning about false positives because it is not supported. It also always shows a data race for the pragma statements which isn't helpful at all.
My current suspects are either Eigen or the fact that I use a reference to a std::vector. Eigen itself I compile with EIGEN_DONT_PARALLELIZE to not mess with nested parallelism although I think I don't use anything that would use it anyway.
Edit:
Not sure if it is really a "data race" (or wrong memory access?) but the example produces non-deterministic output in the form of that the result differs for the same input. If this happens the loop in the main breaks. With more than one thread this happens early (after 5-12 iterations usually). If I run it with one thread only or compile without OpenMP, I have to manually end the example program.
Minimal (not) working example below.
#include <Eigen/Dense>
#include <vector>
#include <iostream>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_set_num_threads(number)
#endif
typedef Eigen::Matrix<double, 9, 1> Vector9d;
typedef std::vector<Vector9d, Eigen::aligned_allocator<Vector9d>> Vector9dList;
Vector9d derivPath(const Vector9dList& pathPositions, int index){
int n = pathPositions.size()-1;
if(index >= 0 && index < n+1){
// path is one point, no derivative possible
if(n == 0){
return Vector9d::Zero();
}
else if(index == n){
return Vector9d::Zero();
}
// path is a line, derivative is in the direction of start to end
else {
return n * (pathPositions[index+1] - pathPositions[index]);
}
}
else{
return Vector9d::Zero();
}
}
// ********************************
// data race occurs here somewhere
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
#pragma omp parallel default(none) shared(pathPositions, err, n)
{
double err_private = 0;
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
// when I replace this with pathPositions[i][0] the loop in the main doesn't break
// (or at least I always had to manually end the program)
// but it does break if I use derivX_i[0];
double err_i = derivX_i.norm();
err_private = err_private + err_i;
}
#pragma omp critical
{
err += err_private;
}
}
err = err / static_cast<double>(n);
return err;
}
// ***************************************
int main(int argc, char **argv){
// setup data
int n = 100;
Vector9dList pathPositions;
pathPositions.reserve(n+1);
double a = 5.0;
double b = 1.0;
double c = 1.0;
Eigen::Vector3d f, u;
f << 0, 0, -1;//-p;
u << 0, 1, 0;
for(int i = 0; i<n+1; ++i){
double t = static_cast<double>(i)/static_cast<double>(n);
Eigen::Vector3d p;
double x = 2*t*a - a;
double z = -b/(a*a) * x*x + b + c;
p << x, 0, z;
Vector9d cam;
cam << p, f, u;
pathPositions.push_back(cam);
}
omp_set_num_threads(8);
//reference value
double pe = errorFunc(pathPositions);
int i = 0;
do{
double pe_i = errorFunc(pathPositions);
// there is a data race
if(std::abs(pe-pe_i) > std::numeric_limits<double>::epsilon()){
std::cout << "Difference detected at iteration " << i << " diff:" << std::abs(pe-pe_i);
break;
}
i++;
}
while(true);
}
Output for running the example multiple times
Difference detected at iteration 13 diff:1.77636e-15
Difference detected at iteration 1 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 0 diff:1.77636e-15
Difference detected at iteration 7 diff:1.77636e-15
Difference detected at iteration 8 diff:1.77636e-15
Difference detected at iteration 6 diff:1.77636e-15
As you can see, the difference is minor but there and it doesn't always happen in the same iteration which makes it non-deterministic. There is no output if I run it single threaded as I usually end the program after letting it run for a couple of minutes. Therefore, it has to have to do with the parallelization somehow.
I know I could use a reduction in this case but in the original code in my project I have to compute other things in the parallel region as well and I wanted to keep the minimal example as close to the original structure as possible.
I use OpenMP in other parts of my program too where I am not sure if I have a data race there too but the structure is similar (except that I use #pragma omp parallel for and the collapse statement). I have some variable or vector I write to but it's always either in a critical region or each thread only writes to it's own subset of the vector. Data that is used by multiple threads is always read-only. The read-only data is always a std::vector, a reference to a std::vector or a numerical data type like int or double. The vectors always contain an Eigen type or double.
There are no race conditions. You are observing a natural consequence of the non-commutative algebra of truncated floating-point representations. (A + B) + C is not always the same as A + (B + C) when A, B, and C are finite-precision floating-point numbers due to rounding errors. 1.77636E-15 x 100 (the absolute error when commenting out err = err / static_cast<double>(n);) in binary is:
0 | 01010101 | 00000000000000000001100
S exponent mantissa
As you can see, the error is in the least significant bits of the mantissa, hinting at it being the result of accumulation of rounding errors.
The problem occurs here:
#pragma omp parallel default(none) shared(pathPositions, err, n)
{
...
#pragma omp critical
{
err += err_private;
}
}
The final value of err depends on the order in which the different threads arrive at the critical section and their contributions get added, which is why sometimes you see discrepancy right away and sometimes it takes a couple of iterations.
To demonstrate that it is not an OpenMP problem per se, simply modify the function to read:
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
std::vector<double> errs(n+1);
#pragma omp parallel default(none) shared(pathPositions, errs, n)
{
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
errs[i] = derivX_i.norm();
}
}
for (int i = 0; i < n+1; ++i)
err += errs[i];
err = err / static_cast<double>(n);
return err;
}
This removes the dependency on how the sub-sums are computed and added together and the return value will always be the same no matter the number of OpenMP threads.
Another version only fixes the order in which err_private are reduced into err:
double errorFunc(const Vector9dList& pathPositions){
int n = pathPositions.size()-1;
double err = 0.0;
std::vector<double> errs(omp_get_max_threads());
int nthreads;
#pragma omp parallel default(none) shared(pathPositions, errs, n, nthreads)
{
#pragma omp master
nthreads = omp_get_num_threads();
double err_private = 0;
#pragma omp for schedule(static)
for(int i = 0; i < n+1; ++i){
Vector9d derivX_i = derivPath(pathPositions, i);
double err_i = derivX_i.norm();
err_private = err_private + err_i;
}
errs[omp_get_thread_num()] = err_private;
}
for (int i = 0; i < nthreads; i++)
err += errs[i];
err = err / static_cast<double>(n);
return err;
}
Again, this code produces the same result each and every time as long as the number of threads is kept constant. The value may differ slightly (in the LSBs) with different number of threads.
You can't get easily around such discrepancy and only learn to live with it and take precautions to minimise its influence on the rest of the computation. In fact, you are really lucky to stumble upon it in 2021, a year in the post-x87 era, when virtually all commodity FPUs use 64-bit IEEE 754 operands and not in the 1990's when x87 FPUs used 80-bit operands and the result of a repeated accumulation would depend on whether you keep the value in an FPU register all the time or periodically store it in and then load it back from memory, which rounds the 80-bit representation to a 64-bit one.
In the mean time, mandatory reading for anyone dealing with math on digital computers.
P.S. Although it is 2021 and we've been living for 21 years in the post-x87 era (started when Pentium 4 introduced the SSE2 instruction set back in 2000), if your CPU is an x86 one, you can still partake in the x87 madness. Just compile your code with -mfpmath=387 :)

OpenMP parallel-for efficiency query

Please consider the following simple code for summing up values in a parallel for loop:
int nMaxThreads = omp_get_max_threads();
int nTotalSum = 0;
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
for (int i = 0; i < 4; i++)
{
nTotalSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
When I run this on a two-core machine, the output I get is
0: nTotalSum is 0
0: nTotalSum is 1
1: nTotalSum is 2
1: nTotalSum is 5
This suggests to me that the critical section, i.e. the update of nTotalSum, is being executed on each loop. This seems like a waste, when all each thread has to do is calculate a 'local' sum of the values it is adding then update nTotalSum with this 'local sum' after it has done so.
Is my interpretation of the output correct, and if so, how can I make it more efficient? Note I tried the following:
#pragma omp parallel for num_threads(nMaxThreads) \
reduction(+:nTotalSum)
int nLocalSum = 0;
for (int i = 0; i < 4; i++)
{
nLocalSum += i;
}
nTotalSum += nLocalSum;
...but the compiler complained stating that it was expecting a for loop following the pragma omp parallel for statement...
Your output does in fact not indicate a critical section during the loop. Each thread has its own zero-initialized copy, thread 0 working on i = 0,1, thread 1 working on i = 2,3. At the end OpenMP takes care of adding the local copies to the original.
You should not try to implement it yourself unless you have specific evidence that you can do it more efficiently. See for example this question / answer.
Your manual version would work if you split the parallel / for into two directives:
int nTotalSum = 0;
#pragma omp parallel
{
// Declare the local variable it here!
// Then it's private implicitly and properly initialized
int localSum = 0;
#pragma omp for
for (int i = 0; i < 4; i++) {
localSum += i;
cout << omp_get_thread_num() << ": nTotalSum is " << nTotalSum << endl;
}
// Do not forget the atomic, or it would be a race condition!
// Alternative would be a critical, but that's less efficient
#pragma omp atomic
nTotalSum += localSum;
}
I think it's likely that your OpenMP implementation does the reduction just like that.
Each OMP thread has its own copy of nTotalSum. At the end of the OMP section these are combined back into the original nTotalSum. The output you're seeing comes from running loop iterations (0,1) in one thread, and (2,3) in another thread. If you output nTotalSum at the end of your loop, you should see the expected result of 6.
In you nLocalSum example, move the declaration of nLocalSum to before the #pragma omp line. The for loop must be on the line immediately following the pragma.
from my parallel programming in openmp book:
reduction clause can be trickier to understand, has both private and shared storage behavior. The reduction attribute is used on objects that are the target of an arithmetic reduction. This can be important in many applications...reduction allows it to be implemented by the compiler efficiently... this is such a common operation that openmp has the reduction data scope clause just to handle them...most common example is final summation of temporary local variables at the end of the parallel construct.
correction to your second example:
total_sum = 0; /* do all variable initialization prior to omp pragma */
#pragma omp parallel for \
private(i) \
reduction(+:total_sum)
for (int i = 0; i < 4; i++)
{
total_sum += i; /* you used nLocalSum here */
}
#pragma omp end parallel for
/* at this point in the code,
all threads will have done your `for` loop where total_sum is local to each thread,
openmp will then '+" together the values in `total_sum` coming from each thread because we used reduction,
do not do an explicit nTotalSum += nLocalSum after the omp for loop, it's not needed the reduction clause takes care of this
*/
In your first example, I'm not sure of your use of #pragma omp parallel for num_threads(nMaxThreads) reduction(+:nTotalSum) of what num_threads(nMaxThreads) is doing. But i suspect the weird output might be caused by print buffering.
In any case, the reduction clause is very useful and very efficient if used properly. It would be more obvious in a more complicated, real-world example.
Your posted example is so simple that it doesn't show off the usefulness of the reduction clause, and strictly speaking for your example since all threads are doing a summation the most efficient way to do it would just make total_sum a shared variable in the parallel section and have all threads pump in to it. At the end the answer would still be correct. would work if using critical directive.

OpenMP loop runs code slower than serial loop

I'm running this neat little gravity simulation and in serial execution it takes a little more than 4 minutes, when i parallelize one loop inside a it increases to about 7 minutes and if i try parallelizing more loops it increases to more than 20 minutes. I'm posting a slightly shortened version without some initializations but I think they don't matter. I'm posting the 7 minute version however with some comments where i wanted to add parallelization to loops. Thank you for helping me with my messy code.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>
#define numb 1000
int main(){
double pos[numb][3],a[numb][3],a_local[3],v[numb][3];
memset(v, 0.0, numb*3*sizeof(double));
double richtung[3];
double t,deltat=0.0,r12 = 0.0,endt=10.;
unsigned seed;
int tcount=0;
#pragma omp parallel private(seed) shared(pos)
{
seed = 25235 + 16*omp_get_thread_num();
#pragma omp for
for(int i=0;i<numb;i++){
for(int j=0;j<3;j++){
pos[i][j] = (double) (rand_r(&seed) % 100000 - 50000);
}
}
}
for(t=0.;t<endt;t+=deltat){
printf("\r%le", t);
tcount++;
#pragma omp parallel for shared(pos,v)
for(int id=0; id<numb; id++){
for(int l=0;l<3;l++){
pos[id][l] = pos[id][l]+(0.5*deltat*v[id][l]);
v[id][l] = v[id][l]+a[id][l]*(deltat);
}
}
memset(a, 0.0, numb*3*sizeof(double));
memset(a_local, 0.0, 3*sizeof(double));
#pragma omp parallel for private(r12,richtung) shared(a,pos)
for(int id=0; id <numb; ++id){
for(int id2=0; id2<id; id2++){
for(int k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(int k=0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12)));
a_local[k] += (-1.0)*richtung[k]/(((r12)*(r12)));
#pragma omp critical
{
a[id2][k] += a_local[k];
}
}
r12=0.0;
}
}
#pragma omp parallel for shared(pos)
for(int id =0; id<numb; id++){
for(int k=0;k<3;k++){
pos[id][k] = pos[id][k]+(0.5*deltat*v[id][k]);
}
}
deltat= 0.01;
}
return 0;
}
I'm using
g++ -fopenmp -o test_grav test_grav.c
to compile the code and I'm measuring time in the shell just by
time ./test_grav.
When I used
get_numb_threads()
to get the number of threads it displayed 4. top also shows more than 300% (sometimes ~380%) cpu usage. Interesting little fact if I start the parallel region before the time-loop (meaning the most outer for-loop) and without any actual #pragma omp for it is equivalent to making one parallel region for every major (the three second to most outer loops) loop. So I think it is an optimization thing, but I don't know how to solve it. Can anyone help me?
Edit: I made the example verifiable and lowered numbers like numb to make it better testable but the problem still occurs. Even when I remove the critical region as suggested by TheQuantumPhysicist, just not as severely.
I believe that critical section is the cause of the problem. Consider taking all critical sections outside the parallelized loop and running them after the parallelization is over.
Try this:
#pragma omp parallel shared(a,pos)
{
#pragma omp for private(id2,k,r12,richtung,a_local)
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
r12 += sqrt((pos[id][k]-pos[id2][k])*(pos[id][k]-pos[id2][k]));
}
for(k =0; k<3;k++){
richtung[k] = (-1.e10)*(pos[id][k]-pos[id2][k])/r12;
a[id][k] += richtung[k]/(((r12)*(r12))+epsilon);
a_local[k]+= richtung[k]/(((r12)*(r12))+epsilon)*(-1.0);
}
}
}
}
for(id=0; id <numb; ++id){
for(id2=0; id2<id; id2++){
for(k=0;k<3;k++){
a[id2][k] += a_local[k];
}
}
}
Critical sections will lead to locking and blocking. If you can keep these sections linear, you'll win a lot in performance.
Notice that I'm talking about a syntactic solution, which I don't know whether it works for your case. But to be clear: If every point in your series depends on the next one, then parallelizing is not a solution for you; at least simple parallelization using OpenMP.

All OpenMP Tasks running on the same thread

I have wrote a recursive parallel function using tasks in OpenMP. While it gives me the correct answer and runs fine I think there is an issue with the parallelism.The run-time in comparison with a serial solution does not scale in the same other parallel problem I have solved without tasks have. When printing each thread for the tasks they are all running on thread 0. I am compiling and running on Visual Studio Express 2013.
int parallelOMP(int n)
{
int a, b, sum = 0;
int alpha = 0, beta = 0;
for (int k = 1; k < n; k++)
{
a = n - (k*(3 * k - 1) / 2);
b = n - (k*(3 * k + 1) / 2);
if (a < 0 && b < 0)
break;
if (a < 0)
alpha = 0;
else if (p[a] != -1)
alpha = p[a];
if (b < 0)
beta = 0;
else if (p[b] != -1)
beta = p[b];
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
alpha = p[a];
beta = p[b];
}
else if (a > 0 && p[a] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp taskwait
}
}
alpha = p[a];
}
else if (b > 0 && p[b] == -1)
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
}
}
beta = p[b];
}
if (k % 2 == 0)
sum += -1 * (alpha + beta);
else
sum += alpha + beta;
}
if (sum > 0)
return sum%m;
else
return (m + (sum % m)) % m;
}
Sometimes I wish comments on SO could be as richly formatted as the answers, but alas that's not the case. Therefore, here comes a long comment disguised as an answer.
It appears that a very common mistake in writing recursive OpenMP code is not understanding how exactly parallel regions work. Consider the following code (uses explicit tasks, therefore support for OpenMP 3.0 or newer required):
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp parallel num_threads(2)
{
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
}
}
// somewhere in the main function
par_rec_func(10);
There is a problem with this code. The problem is that, except for the top-level invocation of par_rec_func(), in all other invocations the parallel region will be created in the context of an enclosing outer parallel region. This is called nested parallelism and by default is disabled, which means that all parallel regions beneath the top-level one are going to be inactive, i.e. they will execute serially. Since tasks bind to the innermost parallel region, they will also get executed in serial. What will happen with this code is that it will spawn one additional thread (for a total of two) at the top-level invocation of par_rec_func() and each thread will then execute a whole branch of the recursion tree (i.e. one half of the whole tree). If one runs that code on a machine with 64 cores, 62 of them will idle. In order for the nested parallelism to be enabled, one has to either set the environment variable OMP_NESTED to true or call omp_set_nested() and pass it a true flag:
omp_set_nested(1);
Once nested parallelism has been enabled, one faces a new problem. Every time a nested parallel region is encountered, the encountering thread will either spawn an additional one (because of num_threads(2)) or acquire an idle thread from the runtime's thread pool. At every deeper level of recursion, this program will require twice as many threads as at the previous level. Though an upper limit of the total number of threads could be set via OMP_THREAD_LIMIT (another OpenMP 3.0 feature) and with the overhead aside, this is not what one really wants in such cases.
The correct solution in that case is to use orphaned tasks in the dynamic scope of a single parallel region:
void par_rec_func (int arg)
{
if (arg <= 0) return;
#pragma omp task
par_rec_func(arg-1);
#pragma omp task
par_rec_func(arg-1);
// Wait for the child tasks to complete if necessary
#pragma omp taskwait
}
// somewhere in the main function
#pragma omp parallel
{
#pragma omp single
par_rec_func(10);
}
The advantages of this method are many. First of all, only a single parallel region is created with as many threads as specified (e.g. by setting OMP_NUM_THREADS or by any other means). When the child tasks call recursively into par_rec_func(), that simply adds new tasks to the parallel region without spawning new threads. This greatly helps in the case where the recursion tree is not balanced, since many quality OpenMP runtimes implement task stealing, e.g. thread i could execute child tasks of a task that executes in thread j, where i != j.
Given an OpenMP 2.0 compiler like VC++, one cannot do much except to approximate the above idea by using nested parallelism and explicitly disabling it at a certain level:
void par_rec_func (int arg)
{
if (arg <= 0) return;
int level = omp_get_level();
#pragma omp parallel sections num_threads(2) if(level < 4)
{
#pragma omp section
par_rec_func(arg-1);
#pragma omp section
par_rec_func(arg-1);
}
}
// somewhere in the main function
int saved_nested = omp_get_nested();
omp_set_nested(1);
par_rec_func(10);
omp_set_nested(saved_nested);
omp_get_level() is used to determine the level of nesting and the if clause is used to selectively deactivate parallel regions at fourth or deeper level of nesting. This solution is dumb and won't work well when the recursion tree is unbalanced.
Actual Problem:
You are using Visual Studio 2013.
Visual Studio has never supported OMP versions beyond 2.0 (see here).
OMP Tasks are a feature of OMP 3.0 (see spec).
Ergo, using VS at all means no OMP tasks for you.
If OMP Tasks are an essential requirement, use a different compiler. If OMP is not an essential requirement, you should consider an alternative parallel task handling library. Visual Studio includes the MS Concurrency Runtime, and the Parallel Patterns Library built on top of it. I have recently moved from OMP to PPL due to the fact I'm using VS for work; it isn't quite a drop-in replacement but it is quite capable.
My second attempt at solving this, again preserved for historical reasons:
So, the problem is almost certainly that you're defining your omp tasks outside of a omp parallel region.
Here's a contrived example:
void work()
{
#pragma omp parallel
{
#pragma omp single nowait
for (int i = 0; i < 5; i++)
{
#pragma omp task untied
{
std::cout <<
"starting task " << i <<
" on thread " << omp_get_thread_num() << "\n";
sleep(1);
}
}
}
}
If you omit the parallel declaration, the job runs serially:
starting task 0 on thread 0
starting task 1 on thread 0
starting task 2 on thread 0
starting task 3 on thread 0
starting task 4 on thread 0
But if you leave it in:
starting task starting task 3 on thread 1
starting task 0 on thread 3
2 on thread 0
starting task 1 on thread 2
starting task 4 on thread 2
Success, complete with authentic misuse of shared output resources.
(for reference, if you omit the single declaration, each thread will run the loop, resulting in 20 tasks being run on my 4 cpu VM).
Original answer included below for completeness, but no longer relevant!
In every case, your omp task is a single, simple thing. It probably runs and completes immediately:
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
#pragma omp task shared(p), untied
cout << omp_get_thread_num();
Because you never start one long-running task before firing off the next task, everything will probably run on the first allocated thread.
Perhaps you meant to do something like this?
if (a > 0 && b > 0 && p[a] == -1 && p[b] == -1)
{
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[a] = parallelOMP(a);
}
#pragma omp task shared(p), untied
{
cout << omp_get_thread_num();
p[b] = parallelOMP(b);
}
#pragma omp taskwait
alpha = p[a];
beta = p[b];
}

Parallel Processing Collision Pairs

I am writing some code for parallel processing of collisions, the expected result would be to have an acceleration for each thread, but I'm not getting any acceleration on the data processing because I have a critical section inside parallel_reduce() and I believe its serializing too much the access to the objects. This is how the code looks:
do {
totalVel = 0.;
#pragma omp parallel for
for (unsigned long i = 0; i < bodyContact.size(); i++) {
totalVel += bodyContact.at(i).bodyA()->parallel_reduce();
totalVel += bodyContact.at(i).bodyB()->parallel_reduce();
}
} while (totalVel >= 0.00001);
Is there any way to gain more speed by making it parallel or the serialization of access is too much?
Observations:
bodyA() and bodyB() are objects that repeat themselves a lot inside the bodyContact container.
For now parallel_reduce() only does one multiplication (the critical section), but will get more complex.
double parallel_reduce(){
#pragma omp critical
this->vel_ *= 0.99;
return vel_.length();
}
Actual timings:
serial, 25.635
parallel, 123.559
There is always cost of using OpenMP constructs, so avoid using a parallel inside a loop, following the implementation it could launch at each time new threads, instead of rewaking the previous launched threads.
In fact, if bodyContact.size() is small and the do {} while in number of step is big and parallel_reduce is very quick is very hard to have scalability with just a few OpenMP pragma.
#pragma omp parallel shared(totalVel) shared(bodyContact)
{
do {
totalVel = 0.;
#pragma omp for reduce(+:totalVel)
for (unsigned long i = 0; i < bodyContact.size(); i++) {
totalVel += bodyContact.at(i).bodyA()->parallel_reduce();
totalVel += bodyContact.at(i).bodyB()->parallel_reduce();
}
} while (totalVel >= 0.00001);
}
The above is likely not only slower, but very likely wrong; all the threads are trying to update the same totalVel. Tonnes of race conditions, but also contention, cache invalidation, etc.
Assuming the parallel_reduce() stuff is ok, you'd like something more like
do {
totalVel = 0.;
#pragma omp parallel for default(none) shared(bodyContact) reduction(+:totalVel)
for (unsigned long i = 0; i < bodyContact.size(); i++) {
totalVel += bodyContact.at(i).bodyA()->parallel_reduce();
totalVel += bodyContact.at(i).bodyB()->parallel_reduce();
}
} while (totalVel >= 0.00001);
which will do the reduction on totalVel correctly.