OpenMP Ordered Parallelization - c++

I'm trying to parallelize the following function (pseudocode):
vector<int32> out;
for (int32 i = 0; i < 10; ++i)
{
int32 result = multiplyStuffByTwo(i);
// Push to results
out.push_back(result);
}
When I now parallelize the for loop and define the push_back part as a critical path, I'm encountering the problem that (of course) the order of the results in out is not always right. How can I make the threads run execute the code in the right order in the last line of the for loop? Thanks!

You can set the size of the out-vector by calling out.resize() and then set the value by index, not by push_back()
Pseudo-code:
vector<int32> out; out.resize(10);
for (int32 i = 0; i < 10; ++i)
{
int32 result = multiplyStuffByTwo(i);
// set the result
out[i] = result;
}
But, I'd recommend using "classic" arrays. They're much faster and not really harder to manage

vector<int32> out;
#pragma omp parallel for ordered
for (int32 i = 0; i < 10; ++i)
{
int32 result = multiplyStuffByTwo(i); // this will be run in parallel
#pragma omp ordered
// Push to results
out.push_back(result); // this will be run sequential
}
This can be helpful:
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

Related

Proper openmp directives for nested for loop operating on 1D array

I need to iterate over an array and assign each element according to a calculation that requires some iteration itself. Removing all unnecessary details the program boils down to something like this.
float output[n];
const float input[n] = ...;
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
some_calculation does not alter its arguments, nor does it have an internal state so its thread safe. Looking at the loops, I understand that the outer loop is thread-safe because different iterations output to different memory locations (different output[i]) and the shared elements of input are never altered while the loop runs, but the inner loop is not thread safe because it has a race condition on output[i] because it is altered in all iterations.
Consequently, I'd like to spawn threads and get them working for different values of i but the whole iteration over input should be local to each thread so as not to introduce a race condition on output[i]. I think the following achieves this.
std::array<float, n> output;
const std::array<float, n> input[n];
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
I'm not sure how this handles the inner loop. Threads working on different is should be able to run the loop in parallel but I don't understand if I'm allowing them to without another #pragma omp directive. On the other hand I don't want to accidentally allow threads to run for different values of j over the same i because that introduces a race condition. I'm also not sure if I need some extra specification on how the two arrays should be handled.
Lastly, if this code is in a function that is going to get called repeatedly, does it need the parallel directive or can that be called once before my main loop begins like so.
void iterative_step(const std::array<float, n> &input, const std::array<float, n> &output) {
// Threads have already been spawned
#pragma omp for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
int main() {
...
// spawn threads once, but not for this loop
#pragma omp parallel
while (...) {
iterative_step(input, output);
}
...
}
I looked through various other questions but they were about different problems with different race conditions and I'm confused as to how to generalize the answers.
You don't want the omp parallel in main. The omp for you use will only create/reuse threads for the following for (int i loop. For any particular value of i, the j loop will run entirely on one thread.
One other thing that would help a little is to compute your output[i] result into a local variable, then store that into output[i] once you're done with the j loop.

openMP increment add int among threads?

Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts
If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.

C++ OpenMP: Writing to a matrix inside of for loop slows down the for loop significantly

I have the following code. The bitCount function simply counts the number of the bits in a 64 bit integer. The test function is an example of something similar I am doing in a more complicated piece of code in which I tried to replicate in it how writing to a matrix slows down significantly the performance of the for loop, and I am trying to figure out why it does so, and if there are any solutions to it.
#include <vector>
#include <cmath>
#include <omp.h>
// Count the number of bits
inline int bitCount(uint64_t n){
int count = 0;
while(n){
n &= (n-1);
count++;
}
return count;
}
void test(){
int nthreads = omp_get_max_threads();
omp_set_dynamic(0);
omp_set_num_threads(nthreads);
// I need a priority queue per thread
std::vector<std::vector<double> > mat(nthreads, std::vector<double>(1000,-INFINITY));
std::vector<uint64_t> vals(100,1);
# pragma omp parallel for shared(mat,vals)
for(int i = 0; i < 100000000; i++){
std::vector<double> &tid_vec = mat[omp_get_thread_num()];
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
tid_vec[j] = total_count; // if I comment out this line, performance increase drastically
}
}
}
This code runs in about 11 seconds. If I comment out the following line:
tid_vec[j] = total_count;
the code runs in about 2 seconds. Is there a reason why writing to a matrix in my case costs so much in performance?
Since you said nothing about your compiler/system specs, I'm assuming you are compiling with GCC and flags -O2 -fopenmp.
If you comment the line:
tid_vec[j] = total_count;
The compiler will optimize away all the computations whose result is not used. Therefore:
total_count += bitCount(vals[j]);
is optimized too. If your application main kernel is not being used, it makes sense the program runs much faster.
On the other hand, I would not implement a bit count function myself but rather rely on functionality that is already provided to you. For example, GCC builtin functions include __builtin_popcount, which does exactly what you are trying to do.
As a bonus: it is way better to work on private data rather than working on a common array using different array elements. It improves locality (specially important when access to memory is not uniform, aka. NUMA) and may reduce access contention.
# pragma omp parallel shared(mat,vals)
{
std::vector<double> local_vec(1000,-INFINITY);
#pragma omp for
for(int i = 0; i < 100000000; i++) {
int total_count = 0;
for(unsigned int j = 0; j < vals.size(); j++){
total_count += bitCount(vals[j]);
local_vec[j] = total_count;
}
}
// Copy local vec to tid_vec[omp_get_thread_num()]
}

OpenMP even/odd decomposition of a nested loop

I have part in my code that could be done parallel, so I started to read about openMP and did these introduction examples. Now I am trying to apply it to the following problem, schematically presented here:
Grid.h
class Grid
{
public:
// has a grid member variable
std::vector<std::vector<int>> 2Dgrid;
// modifies the components of the 2Dgrid, no push_back() etc. used what could possibly disturbe the use of openMP
update_grid(int,int,int,in);
};
Test.h
class Test
{
public:
Grid grid1;
Grid grid2;
update();
repeat_update();
};
Test.cc
.
.
.
Test::repeat_update() {
for(int i=0;i<100000;i++)
update();
}
Test::update() {
int colIndex = 0;
int rowIndex = 0;
int rowIndexPlusOne = rowIndex + 1;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
for (int i = 0; i < DIRECTION_Y; i++) {
// periodic boundry conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++) {
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
rowIndex++;
rowIndexPlusOne++;
}
}
.
.
.
The thing is, the updates done in Test::update(...) could be done in a parallel manner, since the Grid::update(...) only depends on the nearest neighbour of the grid. So for example in the inner loop multiple threads could do the work for colIndex = 0,2,4,..., independetly, that would be the even decomposition. After That the odd indices colIndex=1,3,5,... could be updated. Then the outerloop iterates one forward and the updates in direction x could again be done parallel. I have 16 cores at disposel and doing the parallelization could be a nice time save. But I totally dont have the perspective to see how this could be done, mainly because I dont know how to keep track of the colIndex, rowIndex, etc, since #pragma omp parallel for is applied to the i,j indices. I Would be grateful if somebody can show me the path out of the darkness.
Without knowing exactly what update_grid(int,int,int,int) does, it's kinda tricky to give a definitive answer. You show an embedded pair of loops of the type
for(int i = 0; i < Y; i++)
{
for(int j = 0; j < X; j++)
{
//...
}
}
and assert that the j loop can be done in parallel. This would be an example of fine grained parallelism. You could alternatively parallelize the i loop, in what would be a more coarse grained parallelization. If the amount of work of each individual thread is roughly equal, the coarse graining method has the advantage of less overhead (assuming that the parallelization of the two loops is equivalent).
There are a few things that you have to be careful of when parallelizing the loops. For starters, you increment colIndexPlusOne and colIndex in the inner loop. If you have multiple threads and a single variable for colIndexPlusOne and colIndex, then each thread will increment the variable and/or have race conditions. You can bypass that in several manners, either giving each thread a copy of the variable, or making the increment atomic or critical, or by removing the dependency of the variable altogether and calculating what it should be for each step of the loop on the fly.
I would start with parallelizing the entire update function as such:
Test::update()
{
#pragma omp parallel
{
int colIndex = 0;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
#pragma omp for
for (int i = 0; i < DIRECTION_Y; i++)
{
int rowIndex = i;
int rowIndexPlusOne = rowIndex + 1;
// periodic boundary conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++)
{
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
// The following two can be replaced by j and j+1...
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
// No longer needed:
// rowIndex++;
// rowIndexPlusOne++;
}
}
}
By placing #pragma omp parallel at the beginning, all the variables are local to each thread. Also, at the beginning of the i loop, I assigned rowIndex = i, as at least in the code shown, that is the case. The same could be done for the j loop and colIndex.

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.