Use Eigen Map in OpenMP reduction - c++

I want to use Eigen matrices in combination with OpenMP reduction.
In the following is a small example of how I do it (and it works). The object myclass has three attributes (an Eigen matrix, two integers corresponding to its dimension) and a member function do_something that uses an omp reduction on a sum which I define because Eigen matrices are not standard types.
#include "Eigen/Core"
class myclass {
public:
Eigen::MatrixXd m_mat;
int m_n; // number of rows in m_mat
int m_p; // number of cols in m_mat
myclass(int n, int p); // constructor
void do_something(); // omp reduction on `m_mat`
}
myclass::myclass(int n, int p) {
m_n = n;
m_p = p;
m_mat = Eigen::MatrixXd::Zero(m_n,m_p); // init m_mat with null values
}
#pragma omp declare reduction (+: Eigen::MatrixXd: omp_out=omp_out+omp_in)\
initializer(omp_priv=MatrixXd::Zero(omp_orig.rows(), omp_orig.cols()))
void myclass::do_something() {
Eigen::MatrixXd tmp = Eigen::MatrixXd::Zero(m_n, m_p); // temporary matrix
#pragma omp parallel for reduction(+:tmp)
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
tmp(l,j) += 10;
}
}
}
m_mat = tmp;
}
Problem: OpenMP does not allow (or at least not all implementations) to use reduction on class members but only on variables. Thus, I do the reduction on a temporary matrix and I have this copy at the end m_mat = tmp which I would like to avoid (because m_mat can be a big matrix and I use this reduction a lot in my code).
Wrong fix: I tried to use Eigen Map so that tmp corresponds to data stored in m_mat. Thus, I replaced the omp reduction declaration and the do_something member function definition in the previous code with:
#pragma omp declare reduction (+: Eigen::Map<Eigen::MatrixXd>: omp_out=omp_out+omp_in)\
initializer(omp_priv=MatrixXd::Zero(omp_orig.rows(), omp_orig.cols()))
void myclass::do_something() {
Eigen::Map<Eigen::MatrixXd> tmp = Eigen::Map<Eigen::MatrixXd>(m_mat.data(), m_n, m_p);
#pragma omp parallel for reduction(+:tmp)
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
tmp(l,j) += 10;
}
}
}
}
However, it does not work anymore and I get the following error at compilation:
error: conversion from ‘const ConstantReturnType {aka const
Eigen::CwiseNullaryOp,
Eigen::Matrix >}’ to non-scalar type
‘Eigen::Map, 0, Eigen::Stride<0, 0> >’
requested
initializer(omp_priv=Eigen::MatrixXd::Zero(omp_orig.rows(), omp_orig.cols()))
I get that the implicit conversion from Eigen::MatrixXd to Eigen::Map<Eigen::MatrixXd> does not work in the omp reduction but I don't know how to make it work.
Thanks in advance
Edit 1: I forgot to mention that I use gcc v5.4 on a Ubuntu machine (tried both 16.04 and 18.04)
Edit 2: I modified my example as there was no reduction in the first one. This example is not exactly what I do in my code, it is just a minimum "dumb" example.

As #ggael mentioned in their answer, Eigen::Map can't be used for this because it needs to map to existing storage. If you did make it work, all threads would use the same underlying memory which would create a race condition.
The likeliest solution to avoiding the temporary you create in the initial thread is to bind the member variable to a reference, which should always be valid for use in a reduction. That would look something like this:
void myclass::do_something() {
Eigen::MatrixXd &loc_ref = m_mat; // local binding
#pragma omp parallel for reduction(+:loc_ref)
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
loc_ref(l,j) += 10;
}
}
}
// m_mat = tmp; no longer necessary, reducing into the original
}
That said, note that this still creates a local copy of the zero matrix in each and every thread, much like #ggael showed in the example. Using reduction in this way will be quite expensive. If the actual code is doing something like the code snippet, where values are added based on nested loops like this the reduction could be avoided by dividing the work such that either:
each thread touches a different part of the matrix
an atomic is used to update the individual value
Solution 1 example:
void myclass::do_something() {
// loop transposed so threads split across l
#pragma omp parallel for
for(int l=0; l<m_n; l++) {
for(int i=0; i<m_n;i++) {
for(int j=0; j<m_p; j++) {
loc_ref(l,j) += 10;
}
}
}
}
Solution 2 example:
void myclass::do_something() {
#pragma omp parallel for
for(int i=0; i<m_n;i++) {
for(int l=0; l<m_n; l++) {
for(int j=0; j<m_p; j++) {
auto &target = m_mat(l,j);
// use the ref to get a variable binding through the operator()
#pragma omp atomic
target += 10;
}
}
}
}

The problem is that Eigen::Map can only be created over an existing memory buffer. In your example, the underlying OpenMP implementation will try to do something like that:
Eigen::Map<MatrixXd> tmp_0 = MatrixXd::Zero(r,c);
Eigen::Map<MatrixXd> tmp_1 = MatrixXd::Zero(r,c);
...
/* parallel code, thread #i accumulate in tmp_i */
...
tmp = tmp_0 + tmp_1 + ...;
and stuff like Map<MatrixXd> tmp_0 = MatrixXd::Zero(r,c) is of course not possible. omp_priv has to be a MatrixXd. I don't know if it's possible to customize the type of the private temporaries created by OpenMP. If not you can do the job by hand by creating a std::vector<MatrixXd> tmps[omp_num_threads()]; and doing the final reduction yourself, or better: don't bother about making a single additional copy, it will be largely negligible compared to all other work and copies done by OpenMP itself.

Related

OpenMP taskloop inside task

I am using the OpenMP taskloop construct inside a task construct:
double compute(int input) {
int array[4] = {0};
double value = input;
#pragma omp taskloop private(value)
for(int i=0; i<5000000; i++) {
// random computation, the result is not meaningful
value *= std::tgamma(std::exp(std::cos(std::sin(value)*std::cos(value))));
int tid = omp_get_thread_num();
array[tid] ++;
}
for(int i=0; i<4; i++) {
printf("array[%d] = %d ", i, array[i]);
}
printf("\n");
return value;
}
int main (int argc, char *argv[]) {
omp_set_nested(1);
omp_set_num_threads(4); // 4 cores on my machine
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
{ compute(omp_get_thread_num()); }
}
}
}
The resulting array is all 0. However, if I change the taskloop to parallel for:
#pragma omp parallel for private(value)
for(int i=0; i<5000000; i++) {
value *= std::tgamma(std::exp(std::cos(std::sin(value)*std::cos(value))));
int tid = omp_get_thread_num();
array[tid] ++;
}
Then the result of the array is 1250000 for each index. Is there anything wrong in my use of taskloop construct?
Well by #Cimbali's confirmation, it seems that your issue is that the array is not being shared among threads. Since you did not explicitly say that the variable array is shared or private, OpenMP will determine it by its rules. Tasks have a special data sharing attribute compared to parallel for. I couldn't find anything that specifies the rules explicitly. This was the best I could find. Try specifying a default clause and that the array variable is shared.
According to Data-Sharing Attribute Rules this is the expected behaviour:
"In a task generating construct, if no default clause is present, a variable for which the data-sharing attribute is not determined by the rules above is firstprivate."
BTW: It is always recommended to use default(none), and you will be forced to define data-sharing rules explicitly.

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

Dynamically Readjustable Arrays and OpenMP

I have a function that uses realloc to dynamically adjust the memory of a 1D array as the initial size of the array cannot be predetermined .I want to parallelize this code by dividing the task across multiple threads whereby each thread would work on a smaller 1D array that they would dynamically readjust according to the memory required. As part of the process each thread also produces a private variable that would contain the final size of the small array.
In Openmp I want to access the private copy of these arrays (through the master thread) and put all the small arrays together to obtain a final array based on the size of the arrays computed by each thread.
Is it possible ??
That can be done with a dynamic array such as std::vector. For example let's assume that you have an array called data with values between zero and one and n elements and you want to select out values greater that 0.5 and store them in a new array vec. You could do exactly what you want like this
double data[n];
for(int i=0; i<n; i++) data[i] = 1.0*rand()/RAND_MAX;
std::vector<double> vec;
#pragma omp parallel
{
std::vector<double> vec_private;
#pragma omp for nowait
for(int i=0; i<n; i++) {
if(data[i]>0.5) vec_private.push_back(data[i]);
}
#pragma omp critcal
vec.insert(vec.end(), vec_private.begin(), vec_private.end());
}
To do this without a critical section requires a bit more work. It requires saving an array of the size for each array and them doing a cumulative sum (aka prefix sum) on that array in a single section. Once we have the cumulative sum we can use it to merge the arrays in parallel.
int *sizea;
#pragma omp parallel
{
int nthreads = omp_get_num_threads();
#pragma omp single
{
sizea = new int [nthreads+1];
sizea[0] = 0;
}
std::vector<double> vec_private;
#pragma omp for schedule(static) nowait
for(int i=0; i<n; i++) {
if(data[i]>0.5) vec_private.push_back(data[i]);
}
sizea[omp_get_thread_num()+1] = vec_private.size();
#pragma omp barrier
#pragma omp single
{
int size = 0;
for(int i=0; i<nthreads+1; i++) {
size += sizea[i];
sizea[i] = size;
}
vec.resize(size);
}
std::copy(vec_private.begin(), vec_private.end(), vec.begin()+sizea[omp_get_thread_num()]);
}
delete[] sizea;

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.

Updating a maximum value from multiple threads

Is there a way to update a maximum from multiple threads using atomic operations?
Illustrative example:
std::vector<float> coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
#pragma omp critical (coord_max_update)
coord_max[j] = std::max(coord_max[j], x);
}
In the above case, the critical section synchronizes access to the entire vector, whereas we only need to synchronize access to each of the values independently.
Following a suggestion in a comment, I found a solution that does not require locking and instead uses the compare-and-exchange functionality found in std::atomic / boost::atomic. I am limited to C++03 so I would use boost::atomic in this case.
BOOST_STATIC_ASSERT(sizeof(int) == sizeof(float));
union FloatPun { float f; int i; };
std::vector< boost::atomic<int> > coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i);
FloatPun x, maxval;
x.f = compute_value(j, i);
maxval.i = coord_max[j].load(boost::memory_order_relaxed);
do {
if (maxval.f >= x.f) break;
} while (!coord_max[j].compare_exchange_weak(maxval.i, x.i,
boost::memory_order_relaxed));
}
There is some boilerplate involved in putting float values in ints, since it seems that atomic floats are not lock-free. I am not 100% use about the memory order, but the least restrictive level 'relaxed' seems to be OK, since non-atomic memory is not involved.
Not sure about the syntax, but algorithmically, you have three choices:
Lock down the entire vector to guarantee atomic access (which is what you are currently doing).
Lock down individual elements, so that every element can be updated independent of others. Pros: maximum parallelism; Cons: lots of locks required!
Something in-between! Conceptually think of partitioning your vector into 16 (or 32/64/...) "banks" as follows:
bank0 consists of vector elements 0, 16, 32, 48, 64, ...
bank1 consists of vector elements 1, 17, 33, 49, 65, ...
bank2 consists of vector elements 2, 18, 34, 50, 66, ...
...
Now, use 16 explicit locks before you access the element and you can have upto 16-way parallelism. To access element n, acquire lock (n%16), finish the access, then release the same lock.
How about declaring, say, a std::vector<std::mutex> (or boost::mutex) of length 128 and then creating a lock object using the jth element?
I mean, something like:
std::vector<float> coord_max(128);
std::vector<std::mutex> coord_mutex(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
std::scoped_lock lock(coord_mutex[j]);
coord_max[j] = std::max(coord_max[j], x);
}
Or, as per Rahul Banerjee's suggestion #3:
std::vector<float> coord_max(128);
const int parallelism = 16;
std::vector<std::mutex> coord_mutex(parallelism);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
std::scoped_lock lock(coord_mutex[j % parallelism]);
coord_max[j] = std::max(coord_max[j], x);
}
Just to add my two cents, before starting more fine-grained optimizations I would try the following approach that removes the need for omp critical:
std::vector<float> coord_max(128);
float fbuffer(0);
#pragma omp parallel
{
std::vector<float> thread_local_buffer(128);
// Assume limit is a really big number
#pragma omp for
for (int ii = 0; ii < limit; ++ii) {
int jj = get_coord(ii); // can return any value in range [0,128)
float x = compute_value(jj,ii);
thread_local_buffer[jj] = std::max(thread_local_buffer[jj], x);
}
// At this point each thread has a partial local vector
// containing the maximum of the part of the problem
// it has explored
// Reduce the results
for( int ii = 0; ii < 128; ii++){
// Find the max for position ii
#pragma omp for schedule(static,1) reduction(max:fbuffer)
for( int jj = 0; jj < omp_get_thread_num(); jj++) {
fbuffer = thread_local_buffer[ii];
} // Barrier implied here
// Write it in the vector at correct position
#pragma omp single
{
coord_max[ii] = fbuffer;
fbuffer = 0;
} // Barrier implied here
}
}
Notice that I didn't compile the snippet, so I might have left some syntax error inside. Anyhow I hope I have conveyed the idea.