openMP increment add int among threads? - c++

Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts

If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.

Related

OpenMP/C++: Parallel for loop with reduction afterwards - best practice?

Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.

Is this for-loop valid using OpenMP

I am in the process of learning OpenMP . This is a for loop I am using
std::string result;
#pragma omp parallel
{
#pragma omp parallel for public(local_arg) reduction(+:result)
for(int i=0 ; i<Myvector.size();i++)
{
result = result + someMethod(urn,Myvector[i]);
}
}
Now someMethod(urn,Myvector[i]) which will called by multiple threads in the above code will return a string. This string needs to be appended to the return string. My question is do I need to put a lock on the statement in the for loop ? Is there a better approach ? Any suggestions ?
This isn't perfect (and it's been a while since I've used OpenMP), but the idea is basic divide-and-conquer.
std::vector<std::string> results;
int n = 2*omp_get_num_threads();
results.reserve(n); // For reliability, ask OS about # of cores, double that.
// Reserve a small string for each prospective worker
for(int i = 0; i < n; ++i){
std::string str{};
str.reserve(worker_reserve);
results.push_back(move(str));
}
// Let each worker grab and mutate the string
// corresponding to its worker ID
//
#pragma omp parallel for
for(int i = 0; i < Myvector.size(); ++i)
{
auto &str = results[omp_get_thread_num()];
str.append(someMethod(urn, Myvector[i]));
}
// Measure the total size of the result
std::string end_result;
size_t total_len = 0;
for(auto &res : results){
total_len += res.length();
}
// Reserve and combine
end_result.reserve(total_len + 1);
for(auto &res : results){
end_result.append(res);
}
However, there is still the issue of heap contention.
Also omp_get_num_threads isn't guaranteed to return the actual number of threads.

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.

Updating a maximum value from multiple threads

Is there a way to update a maximum from multiple threads using atomic operations?
Illustrative example:
std::vector<float> coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
#pragma omp critical (coord_max_update)
coord_max[j] = std::max(coord_max[j], x);
}
In the above case, the critical section synchronizes access to the entire vector, whereas we only need to synchronize access to each of the values independently.
Following a suggestion in a comment, I found a solution that does not require locking and instead uses the compare-and-exchange functionality found in std::atomic / boost::atomic. I am limited to C++03 so I would use boost::atomic in this case.
BOOST_STATIC_ASSERT(sizeof(int) == sizeof(float));
union FloatPun { float f; int i; };
std::vector< boost::atomic<int> > coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i);
FloatPun x, maxval;
x.f = compute_value(j, i);
maxval.i = coord_max[j].load(boost::memory_order_relaxed);
do {
if (maxval.f >= x.f) break;
} while (!coord_max[j].compare_exchange_weak(maxval.i, x.i,
boost::memory_order_relaxed));
}
There is some boilerplate involved in putting float values in ints, since it seems that atomic floats are not lock-free. I am not 100% use about the memory order, but the least restrictive level 'relaxed' seems to be OK, since non-atomic memory is not involved.
Not sure about the syntax, but algorithmically, you have three choices:
Lock down the entire vector to guarantee atomic access (which is what you are currently doing).
Lock down individual elements, so that every element can be updated independent of others. Pros: maximum parallelism; Cons: lots of locks required!
Something in-between! Conceptually think of partitioning your vector into 16 (or 32/64/...) "banks" as follows:
bank0 consists of vector elements 0, 16, 32, 48, 64, ...
bank1 consists of vector elements 1, 17, 33, 49, 65, ...
bank2 consists of vector elements 2, 18, 34, 50, 66, ...
...
Now, use 16 explicit locks before you access the element and you can have upto 16-way parallelism. To access element n, acquire lock (n%16), finish the access, then release the same lock.
How about declaring, say, a std::vector<std::mutex> (or boost::mutex) of length 128 and then creating a lock object using the jth element?
I mean, something like:
std::vector<float> coord_max(128);
std::vector<std::mutex> coord_mutex(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
std::scoped_lock lock(coord_mutex[j]);
coord_max[j] = std::max(coord_max[j], x);
}
Or, as per Rahul Banerjee's suggestion #3:
std::vector<float> coord_max(128);
const int parallelism = 16;
std::vector<std::mutex> coord_mutex(parallelism);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
std::scoped_lock lock(coord_mutex[j % parallelism]);
coord_max[j] = std::max(coord_max[j], x);
}
Just to add my two cents, before starting more fine-grained optimizations I would try the following approach that removes the need for omp critical:
std::vector<float> coord_max(128);
float fbuffer(0);
#pragma omp parallel
{
std::vector<float> thread_local_buffer(128);
// Assume limit is a really big number
#pragma omp for
for (int ii = 0; ii < limit; ++ii) {
int jj = get_coord(ii); // can return any value in range [0,128)
float x = compute_value(jj,ii);
thread_local_buffer[jj] = std::max(thread_local_buffer[jj], x);
}
// At this point each thread has a partial local vector
// containing the maximum of the part of the problem
// it has explored
// Reduce the results
for( int ii = 0; ii < 128; ii++){
// Find the max for position ii
#pragma omp for schedule(static,1) reduction(max:fbuffer)
for( int jj = 0; jj < omp_get_thread_num(); jj++) {
fbuffer = thread_local_buffer[ii];
} // Barrier implied here
// Write it in the vector at correct position
#pragma omp single
{
coord_max[ii] = fbuffer;
fbuffer = 0;
} // Barrier implied here
}
}
Notice that I didn't compile the snippet, so I might have left some syntax error inside. Anyhow I hope I have conveyed the idea.

Parallel OpenMP loop with break statement

I know that you cannot have a break statement for an OpenMP loop, but I was wondering if there is any workaround while still the benefiting from parallelism. Basically I have 'for' loop, that loops through the elements of a large vector looking for one element that satisfies a certain condition. However there is only one element that will satisfy the condition so once that is found we can break out of the loop, Thanks in advance
for(int i = 0; i <= 100000; ++i)
{
if(element[i] ...)
{
....
break;
}
}
See this snippet:
volatile bool flag=false;
#pragma omp parallel for shared(flag)
for(int i=0; i<=100000; ++i)
{
if(flag) continue;
if(element[i] ...)
{
...
flag=true;
}
}
This situation is more suitable for pthread.
You could try to manually do what the openmp for loop does, using a while loop:
const int N = 100000;
std::atomic<bool> go(true);
uint give = 0;
#pragma omp parallel
{
uint i, stop;
#pragma omp critical
{
i = give;
give += N/omp_get_num_threads();
stop = give;
if(omp_get_thread_num() == omp_get_num_threads()-1)
stop = N;
}
while(i < stop && go)
{
...
if(element[i]...)
{
go = false;
}
i++;
}
}
This way you have to test "go" each cycle, but that should not matter that much. More important is that this would correspond to a "static" omp for loop, which is only useful if you can expect all iterations to take a similar amount of time. Otherwise, 3 threads may be already finished while one still has halfway to got...
I would probably do (copied a bit from yyfn)
volatile bool flag=false;
for(int j=0; j<=100 && !flag; ++j) {
int base = 1000*j;
#pragma omp parallel for shared(flag)
for(int i = 0; i <= 1000; ++i)
{
if(flag) continue;
if(element[i+base] ...)
{
....
flag=true;
}
}
}
Here is a simpler version of the accepted answer.
int ielement = -1;
#pragma omp parallel
{
int i = omp_get_thread_num()*n/omp_get_num_threads();
int stop = (omp_get_thread_num()+1)*n/omp_get_num_threads();
for(;i <stop && ielement<0; ++i){
if(element[i]) {
ielement = i;
}
}
}
bool foundCondition = false;
#pragma omp parallel for
for(int i = 0; i <= 100000; i++)
{
// We can't break out of a parallel for loop, so this is the next best thing.
if (foundCondition == false && satisfiesComplicatedCondition(element[i]))
{
// This is definitely needed if more than one element could satisfy the
// condition and you are looking for the first one. Probably still a
// good idea even if there can only be one.
#pragma omp critical
{
// do something, store element[i], or whatever you need to do here
....
foundCondition = true;
}
}
}