Is there a way to update a maximum from multiple threads using atomic operations?
Illustrative example:
std::vector<float> coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
#pragma omp critical (coord_max_update)
coord_max[j] = std::max(coord_max[j], x);
}
In the above case, the critical section synchronizes access to the entire vector, whereas we only need to synchronize access to each of the values independently.
Following a suggestion in a comment, I found a solution that does not require locking and instead uses the compare-and-exchange functionality found in std::atomic / boost::atomic. I am limited to C++03 so I would use boost::atomic in this case.
BOOST_STATIC_ASSERT(sizeof(int) == sizeof(float));
union FloatPun { float f; int i; };
std::vector< boost::atomic<int> > coord_max(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i);
FloatPun x, maxval;
x.f = compute_value(j, i);
maxval.i = coord_max[j].load(boost::memory_order_relaxed);
do {
if (maxval.f >= x.f) break;
} while (!coord_max[j].compare_exchange_weak(maxval.i, x.i,
boost::memory_order_relaxed));
}
There is some boilerplate involved in putting float values in ints, since it seems that atomic floats are not lock-free. I am not 100% use about the memory order, but the least restrictive level 'relaxed' seems to be OK, since non-atomic memory is not involved.
Not sure about the syntax, but algorithmically, you have three choices:
Lock down the entire vector to guarantee atomic access (which is what you are currently doing).
Lock down individual elements, so that every element can be updated independent of others. Pros: maximum parallelism; Cons: lots of locks required!
Something in-between! Conceptually think of partitioning your vector into 16 (or 32/64/...) "banks" as follows:
bank0 consists of vector elements 0, 16, 32, 48, 64, ...
bank1 consists of vector elements 1, 17, 33, 49, 65, ...
bank2 consists of vector elements 2, 18, 34, 50, 66, ...
...
Now, use 16 explicit locks before you access the element and you can have upto 16-way parallelism. To access element n, acquire lock (n%16), finish the access, then release the same lock.
How about declaring, say, a std::vector<std::mutex> (or boost::mutex) of length 128 and then creating a lock object using the jth element?
I mean, something like:
std::vector<float> coord_max(128);
std::vector<std::mutex> coord_mutex(128);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
std::scoped_lock lock(coord_mutex[j]);
coord_max[j] = std::max(coord_max[j], x);
}
Or, as per Rahul Banerjee's suggestion #3:
std::vector<float> coord_max(128);
const int parallelism = 16;
std::vector<std::mutex> coord_mutex(parallelism);
#pragma omp parallel for
for (int i = 0; i < limit; ++i) {
int j = get_coord(i); // can return any value in range [0,128)
float x = compute_value(j, i);
std::scoped_lock lock(coord_mutex[j % parallelism]);
coord_max[j] = std::max(coord_max[j], x);
}
Just to add my two cents, before starting more fine-grained optimizations I would try the following approach that removes the need for omp critical:
std::vector<float> coord_max(128);
float fbuffer(0);
#pragma omp parallel
{
std::vector<float> thread_local_buffer(128);
// Assume limit is a really big number
#pragma omp for
for (int ii = 0; ii < limit; ++ii) {
int jj = get_coord(ii); // can return any value in range [0,128)
float x = compute_value(jj,ii);
thread_local_buffer[jj] = std::max(thread_local_buffer[jj], x);
}
// At this point each thread has a partial local vector
// containing the maximum of the part of the problem
// it has explored
// Reduce the results
for( int ii = 0; ii < 128; ii++){
// Find the max for position ii
#pragma omp for schedule(static,1) reduction(max:fbuffer)
for( int jj = 0; jj < omp_get_thread_num(); jj++) {
fbuffer = thread_local_buffer[ii];
} // Barrier implied here
// Write it in the vector at correct position
#pragma omp single
{
coord_max[ii] = fbuffer;
fbuffer = 0;
} // Barrier implied here
}
}
Notice that I didn't compile the snippet, so I might have left some syntax error inside. Anyhow I hope I have conveyed the idea.
Related
Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts
If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.
Given the following code...
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity[j] += f(j);
}
...which I would like to run on multiple CPUs/cores. The function f does not use velocity.
A simple #pragma omp parallel for before the first for loop will produce unpredictable/wrong results, because the std::vector<T> velocity is modified in the inner loop. Multiple threads may access and (try to) modify the same element of velocity at the same time.
I think the first solution would be to write #pragma omp atomic before the velocity[j] += f(j);operation. This gives me a compile error (might have something to do with the elements being of type Eigen::Vector3d or velocity being a class member). Also, I read atomic operations are very slow compared to having a private variable for each thread and doing a reduction in the end. So that's what I would like to do, I think.
I have come up with this:
#pragma omp parallel
{
// these variables are local to each thread
std::vector<Eigen::Vector3d> velocity_local(velocity.size());
std::fill(velocity_local.begin(), velocity_local.end(), Eigen::Vector3d(0,0,0));
#pragma omp for
for (size_t i = 0; i < clusters.size(); ++i)
{
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster)
velocity_local[j] += f(j); // save results from the previous calculations
}
// now each thread can save its results to the global variable
#pragma omp critical
{
for (size_t i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
}
}
Is this a good solution? Is it the best solution? (Is it even correct?)
Further thoughts: Using the reduce clause (instead of the critical section) throws a compiler error. I think this is because velocity is a class member.
I have tried to find a question with a similar problem, and this question looks like it's almost the same. But I think my case might differ because the last step includes a for loop. Also the question whether this is the best approach still holds.
Edit: As request per comment: The reduction clause...
#pragma omp parallel reduction(+:velocity)
for (omp_int i = 0; i < velocity_local.size(); ++i)
velocity[i] += velocity_local[i];
...throws the following error:
error C3028: 'ShapeMatching::velocity' : only a variable or static data member can be used in a data-sharing clause
(similar error with g++)
You're doing an array reduction. I have described this several times (e.g. reducing an array in openmp and fill histograms array reduction in parallel with openmp without using a critical section). You can do this with and without a critical section.
You have already done this correctly with a critical section (in your recent edit) so let me describe how to do this without a critical section.
std::vector<Eigen::Vector3d> velocitya;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int vsize = velocity.size();
#pragma omp single
velocitya.resize(vsize*nthreads);
std::fill(velocitya.begin()+vsize*ithread, velocitya.begin()+vsize*(ithread+1),
Eigen::Vector3d(0,0,0));
#pragma omp for schedule(static)
for (size_t i = 0; i < clusters.size(); i++) {
const std::set<int>& cluster = clusters[i];
// ... expensive calculations ...
for (int j : cluster) velocitya[ithread*vsize+j] += f(j);
}
#pragma omp for schedule(static)
for(int i=0; i<vsize; i++) {
for(int t=0; t<nthreads; t++) {
velocity[i] += velocitya[vsize*t + i];
}
}
}
This method requires extra care/tuning due to false sharing which I have not done.
As to which method is better you will have to test.
I have part in my code that could be done parallel, so I started to read about openMP and did these introduction examples. Now I am trying to apply it to the following problem, schematically presented here:
Grid.h
class Grid
{
public:
// has a grid member variable
std::vector<std::vector<int>> 2Dgrid;
// modifies the components of the 2Dgrid, no push_back() etc. used what could possibly disturbe the use of openMP
update_grid(int,int,int,in);
};
Test.h
class Test
{
public:
Grid grid1;
Grid grid2;
update();
repeat_update();
};
Test.cc
.
.
.
Test::repeat_update() {
for(int i=0;i<100000;i++)
update();
}
Test::update() {
int colIndex = 0;
int rowIndex = 0;
int rowIndexPlusOne = rowIndex + 1;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
for (int i = 0; i < DIRECTION_Y; i++) {
// periodic boundry conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++) {
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
rowIndex++;
rowIndexPlusOne++;
}
}
.
.
.
The thing is, the updates done in Test::update(...) could be done in a parallel manner, since the Grid::update(...) only depends on the nearest neighbour of the grid. So for example in the inner loop multiple threads could do the work for colIndex = 0,2,4,..., independetly, that would be the even decomposition. After That the odd indices colIndex=1,3,5,... could be updated. Then the outerloop iterates one forward and the updates in direction x could again be done parallel. I have 16 cores at disposel and doing the parallelization could be a nice time save. But I totally dont have the perspective to see how this could be done, mainly because I dont know how to keep track of the colIndex, rowIndex, etc, since #pragma omp parallel for is applied to the i,j indices. I Would be grateful if somebody can show me the path out of the darkness.
Without knowing exactly what update_grid(int,int,int,int) does, it's kinda tricky to give a definitive answer. You show an embedded pair of loops of the type
for(int i = 0; i < Y; i++)
{
for(int j = 0; j < X; j++)
{
//...
}
}
and assert that the j loop can be done in parallel. This would be an example of fine grained parallelism. You could alternatively parallelize the i loop, in what would be a more coarse grained parallelization. If the amount of work of each individual thread is roughly equal, the coarse graining method has the advantage of less overhead (assuming that the parallelization of the two loops is equivalent).
There are a few things that you have to be careful of when parallelizing the loops. For starters, you increment colIndexPlusOne and colIndex in the inner loop. If you have multiple threads and a single variable for colIndexPlusOne and colIndex, then each thread will increment the variable and/or have race conditions. You can bypass that in several manners, either giving each thread a copy of the variable, or making the increment atomic or critical, or by removing the dependency of the variable altogether and calculating what it should be for each step of the loop on the fly.
I would start with parallelizing the entire update function as such:
Test::update()
{
#pragma omp parallel
{
int colIndex = 0;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
#pragma omp for
for (int i = 0; i < DIRECTION_Y; i++)
{
int rowIndex = i;
int rowIndexPlusOne = rowIndex + 1;
// periodic boundary conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++)
{
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
// The following two can be replaced by j and j+1...
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
// No longer needed:
// rowIndex++;
// rowIndexPlusOne++;
}
}
}
By placing #pragma omp parallel at the beginning, all the variables are local to each thread. Also, at the beginning of the i loop, I assigned rowIndex = i, as at least in the code shown, that is the case. The same could be done for the j loop and colIndex.
I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.