OpenMP taskloop inside task

OpenMP taskloop inside task - c++

I am using the OpenMP taskloop construct inside a task construct:
double compute(int input) {
int array[4] = {0};
double value = input;
#pragma omp taskloop private(value)
for(int i=0; i<5000000; i++) {
// random computation, the result is not meaningful
value *= std::tgamma(std::exp(std::cos(std::sin(value)*std::cos(value))));
int tid = omp_get_thread_num();
array[tid] ++;
}
for(int i=0; i<4; i++) {
printf("array[%d] = %d ", i, array[i]);
}
printf("\n");
return value;
}
int main (int argc, char *argv[]) {
omp_set_nested(1);
omp_set_num_threads(4); // 4 cores on my machine
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{ compute(omp_get_thread_num()); }
}
}
}
The resulting array is all 0. However, if I change the taskloop to parallel for:
#pragma omp parallel for private(value)
for(int i=0; i<5000000; i++) {
value *= std::tgamma(std::exp(std::cos(std::sin(value)*std::cos(value))));
int tid = omp_get_thread_num();
array[tid] ++;
}
Then the result of the array is 1250000 for each index. Is there anything wrong in my use of taskloop construct?

Well by #Cimbali's confirmation, it seems that your issue is that the array is not being shared among threads. Since you did not explicitly say that the variable array is shared or private, OpenMP will determine it by its rules. Tasks have a special data sharing attribute compared to parallel for. I couldn't find anything that specifies the rules explicitly. This was the best I could find. Try specifying a default clause and that the array variable is shared.

According to Data-Sharing Attribute Rules this is the expected behaviour:
"In a task generating construct, if no default clause is present, a variable for which the data-sharing attribute is not determined by the rules above is firstprivate."
BTW: It is always recommended to use default(none), and you will be forced to define data-sharing rules explicitly.

Related

atomic inside a single construct

In an openMP framework, suppose I have a series of tasks that should be done by a single task. Each task is different, so I cannot fit into a #pragma omp for construct. Inside the single construct, each task updates a variable shared by all tasks. How can I protect the update of such a variable?
A simplified example:
#include <vector>
struct A {
std::vector<double> x, y, z;
};
int main()
{
A r;
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
#pragma omp barrier
return 0;
}
The code lines below // DANGER are problematic because they modify the memory contents of a shared variable.
In the example above, it might be that it still works without issues, because I am effectively modifying different members of r. Still the problem is: how can I make sure that tasks do not simultaineusly update r? Is there a "sort-of" atomic pragma for the single construct?

There is no data race in your original code, because x,y, and z are different vectors in struct A (as already emphasized by #463035818_is_not_a_number), so in this respect you do not have to change anything in your code.
However, a #pragma omp parallel directive is missing in your code, so at the moment it is a serial program. So, it should look like this:
#pragma omp parallel num_threads(3)
{
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i);
// DANGER
r.x = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i);
// DANGER
r.y = std::move(res);
}
#pragma omp single nowait
{
std::vector<double> res;
for (int i = 0; i < 10; ++i)
res.push_back(i * i + 2);
// DANGER
r.z = std::move(res);
}
}
In this case #pragma omp barrier is not necessary as there is an implied barrier at the end of parallel region. Note that I have used num_threads(3) clause to make sure that only 3 threads are assigned to this parallel region. If you skip this clause then all other threads just wait at the barrier.
In the case of an actual data race (i.e. more than one single region/section changes the same struct member), you can use #pragma omp critical (name) to rectify this. But keep in mind that this kind of serialization can negate the benefits of multithreading when there is not enough real parallel work beside the critical section.
Note that, a much better solution is to use #pragma omp sections (as suggested by #PaulG). If the number of tasks to run parallel is known at compile time sections are the typical choice in OpenMP:
#pragma omp parallel sections
{
#pragma omp section
{
//Task 1 here
}
#pragma omp section
{
//Task 2
}
#pragma omp section
{
// Task 3
}
}
For the record, I would like to show that it is easy to do it by #pragma omp for as well:
#pragma omp parallel for
for(int i=0;i<3;i++)
{
if (i==0)
{
// Task 1
} else if (i==1)
{
// Task 2
}
else if (i==2)
{
// Task 3
}
}

each task updates a variable shared by all tasks.
Actually they don't. Consider you rewrite the code like this (you don't need the temporary vectors):
void foo( std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
x.push_back(i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
y.push_back(i * i);
}
#pragma omp single nowait
{
for (int i = 0; i < 10; ++i)
z.push_back(i * i + 2);
}
#pragma omp barrier
}
As long as the caller can ensure that x, y and z do not refer to the same object, there is no data race. Each part of the code modifies a seperate vector. No synchronization needed.
Now, it does not matter where those vectors come from. You can call the function like this:
A r;
foo(r.x, r.y, r.z);
PS: I am not familiar with omp anymore, I assumed the annotations correctly do what you want them to do.

Use Eigen Map in OpenMP reduction

I want to use Eigen matrices in combination with OpenMP reduction.
In the following is a small example of how I do it (and it works). The object myclass has three attributes (an Eigen matrix, two integers corresponding to its dimension) and a member function do_something that uses an omp reduction on a sum which I define because Eigen matrices are not standard types.
#include "Eigen/Core"
class myclass {
public:
Eigen::MatrixXd m_mat;
int m_n; // number of rows in m_mat
int m_p; // number of cols in m_mat
myclass(int n, int p); // constructor
void do_something(); // omp reduction on `m_mat`
}
myclass::myclass(int n, int p) {
m_n = n;
m_p = p;
m_mat = Eigen::MatrixXd::Zero(m_n,m_p); // init m_mat with null values
}
#pragma omp declare reduction (+: Eigen::MatrixXd: omp_out=omp_out+omp_in)\
initializer(omp_priv=MatrixXd::Zero(omp_orig.rows(), omp_orig.cols()))
void myclass::do_something() {
Eigen::MatrixXd tmp = Eigen::MatrixXd::Zero(m_n, m_p); // temporary matrix
#pragma omp parallel for reduction(+:tmp)
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
tmp(l,j) += 10;
}
}
}
m_mat = tmp;
}
Problem: OpenMP does not allow (or at least not all implementations) to use reduction on class members but only on variables. Thus, I do the reduction on a temporary matrix and I have this copy at the end m_mat = tmp which I would like to avoid (because m_mat can be a big matrix and I use this reduction a lot in my code).
Wrong fix: I tried to use Eigen Map so that tmp corresponds to data stored in m_mat. Thus, I replaced the omp reduction declaration and the do_something member function definition in the previous code with:
#pragma omp declare reduction (+: Eigen::Map<Eigen::MatrixXd>: omp_out=omp_out+omp_in)\
initializer(omp_priv=MatrixXd::Zero(omp_orig.rows(), omp_orig.cols()))
void myclass::do_something() {
Eigen::Map<Eigen::MatrixXd> tmp = Eigen::Map<Eigen::MatrixXd>(m_mat.data(), m_n, m_p);
#pragma omp parallel for reduction(+:tmp)
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
tmp(l,j) += 10;
}
}
}
}
However, it does not work anymore and I get the following error at compilation:
error: conversion from ‘const ConstantReturnType {aka const
Eigen::CwiseNullaryOp,
Eigen::Matrix >}’ to non-scalar type
‘Eigen::Map, 0, Eigen::Stride<0, 0> >’
requested
initializer(omp_priv=Eigen::MatrixXd::Zero(omp_orig.rows(), omp_orig.cols()))
I get that the implicit conversion from Eigen::MatrixXd to Eigen::Map<Eigen::MatrixXd> does not work in the omp reduction but I don't know how to make it work.
Thanks in advance
Edit 1: I forgot to mention that I use gcc v5.4 on a Ubuntu machine (tried both 16.04 and 18.04)
Edit 2: I modified my example as there was no reduction in the first one. This example is not exactly what I do in my code, it is just a minimum "dumb" example.

As #ggael mentioned in their answer, Eigen::Map can't be used for this because it needs to map to existing storage. If you did make it work, all threads would use the same underlying memory which would create a race condition.
The likeliest solution to avoiding the temporary you create in the initial thread is to bind the member variable to a reference, which should always be valid for use in a reduction. That would look something like this:
void myclass::do_something() {
Eigen::MatrixXd &loc_ref = m_mat; // local binding
#pragma omp parallel for reduction(+:loc_ref)
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
loc_ref(l,j) += 10;
}
}
}
// m_mat = tmp; no longer necessary, reducing into the original
}
That said, note that this still creates a local copy of the zero matrix in each and every thread, much like #ggael showed in the example. Using reduction in this way will be quite expensive. If the actual code is doing something like the code snippet, where values are added based on nested loops like this the reduction could be avoided by dividing the work such that either:
each thread touches a different part of the matrix
an atomic is used to update the individual value
Solution 1 example:
void myclass::do_something() {
// loop transposed so threads split across l
#pragma omp parallel for
for(int l=0; l<m_n; l++) {
for(int i=0; i<m_n;i++) {
for(int j=0; j<m_p; j++) {
loc_ref(l,j) += 10;
}
}
}
}
Solution 2 example:
void myclass::do_something() {
#pragma omp parallel for
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
auto &target = m_mat(l,j);
// use the ref to get a variable binding through the operator()
#pragma omp atomic
target += 10;
}
}
}
}

The problem is that Eigen::Map can only be created over an existing memory buffer. In your example, the underlying OpenMP implementation will try to do something like that:
Eigen::Map<MatrixXd> tmp_0 = MatrixXd::Zero(r,c);
Eigen::Map<MatrixXd> tmp_1 = MatrixXd::Zero(r,c);
...
/* parallel code, thread #i accumulate in tmp_i */
...
tmp = tmp_0 + tmp_1 + ...;
and stuff like Map<MatrixXd> tmp_0 = MatrixXd::Zero(r,c) is of course not possible. omp_priv has to be a MatrixXd. I don't know if it's possible to customize the type of the private temporaries created by OpenMP. If not you can do the job by hand by creating a std::vector<MatrixXd> tmps[omp_num_threads()]; and doing the final reduction yourself, or better: don't bother about making a single additional copy, it will be largely negligible compared to all other work and copies done by OpenMP itself.

Local Variables in a for loop openmp

I've just started to program with openmp and I'm trying to parallelize a for loop with a variable that I need out of the loop. Something like this:
float a = 0;
for (int i = 0; i < x; i++)
{
int x = algorithm();
/* Each loop, x have a different value*/
a = a + x;
}
cout << a;
I think the variable a has to be a local variable for each thread. After those thread have ended their job, all the local variables a should be added into one final result.
How can I do that?

Use the #pragma omp parallel for reduction(+:a) clause before the for loop
variable declared within the for loop are local, as well as loop counters
variable declared outside the #pragma omp parallel block are shared by default, unless otherwise specified (see shared, private, firstprivate clauses). Care should be taken when updating shared variables as a race condition may occur.
In this case, the reduction(+:a) clause indicated that a is a shared variable on which an addition is performed at each loop. Threads will automatically keep track of the total amount to be added and safely increment a at the end of the loop.
Both codes below are equivalent:
float a = 0.0f;
int n=1000;
#pragma omp parallel shared(a) //spawn the threads
{
float acc=0; // local accumulator to each thread
#pragma omp for // iterations will be shared among the threads
for (int i = 0; i < n; i++){
float x = algorithm(i); //do something
acc += x; //local accumulator increment
} //for
#omp pragma atomic
a+=acc; //atomic global accumulator increment: done on thread at a time
} //end parallel region, back to a single thread
cout << a;
Is equivalent to:
float a = 0.0f;
int n=1000;
#pragma omp parallel for reduction(+:a)
for (int i = 0; i < n; i++){
int x = algorithm(i);
a += x;
} //parallel for
cout << a;
Note that you can't make a for loop with a stop condition i<x where x is a local variable defined within the loop.

There are many mechanisms how to achieve your goal, but the most simple is to employ OpenMP parallel reduction:
float a = 0.0f;
#pragma omp parallel for reduction(+:a)
for(int i = 0; i < x; i++)
a += algorithm();
cout << a;

You can use the following structure to perform parallel reduction with thread-private containers since your update is scalar associative.
float a = 0;//Global and will be shared.
#pragma omp parallel
{
float y = 0;//Private to each thread
#pragma omp for
for(int i = 0; i < x; i++)
y += algorithm();//Better practice is to not use same variable as loop termination variable.
//Still inside parallel
#pragma omp atomic
a += y;
}
cout << a;

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)

You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

What happens in OpenMP when there's a pragma for inside a pragma for?

At the start of #pragma omp parallel a bunch of threads are created, then when we get to #pragma omp for the workload is distributed. What happens if this for loop has a for loop inside it, and I place a #pragma omp for before it as well? Does each thread create new threads? If not, which threads are assigned this task? What exactly happens in this situation?

By default, no threads are spawned for the inner loop. It is done sequentially using the thread that reaches it.
This is because nesting is disabled by default. However, if you enable nesting via omp_set_nested(), then a new set of threads will be spawned.
However, if you aren't careful, this will result in p^2 number of threads (since each of the original p threads will spawn another p threads.) Therefore nesting is disabled by default.

In a situation like the following:
#pragma omp parallel
{
#pragma omp for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
}
what happens is that you trigger an undefined behavior as you violate the OpenMP standard. More precisely you violate the restrictions appearing in section 2.5 (worksharing constructs):
The following restrictions apply to worksharing constructs:
Each worksharing region must be encountered by all threads in a team or by none at all.
The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.
This is clearly shown in the examples A.39.1c and A.40.1c:
Example A.39.1c: The following example of loop construct nesting is conforming because the inner and outer loop regions bind to different parallel
regions:
void work(int i, int j) {}
void good_nesting(int n)
{
int i, j;
#pragma omp parallel default(shared)
{
#pragma omp for
for (i=0; i<n; i++) {
#pragma omp parallel shared(i, n)
{
#pragma omp for
for (j=0; j < n; j++)
work(i, j);
}
}
}
}
Example A.40.1c: The following example is non-conforming because the inner and outer loop regions are closely nested
void work(int i, int j) {}
void wrong1(int n)
{
#pragma omp parallel default(shared)
{
int i, j;
#pragma omp for
for (i=0; i<n; i++) {
/* incorrect nesting of loop regions */
#pragma omp for
for (j=0; j<n; j++)
work(i, j);
}
}
}
Notice that this is different from:
#pragma omp parallel for
for(int ii = 0; ii < n; ii++) {
/* ... */
#pragma omp parallel for
for(int jj = 0; jj < m; jj++) {
/* ... */
}
}
in which you try to spawn a nested parallel region. Only in this case the discussion of Mysticial answer holds.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

OpenMP taskloop inside task - c++

Related

atomic inside a single construct

Use Eigen Map in OpenMP reduction

Local Variables in a for loop openmp

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

What happens in OpenMP when there's a pragma for inside a pragma for?

Categories

Resources