OpenMP even/odd decomposition of a nested loop - c++

I have part in my code that could be done parallel, so I started to read about openMP and did these introduction examples. Now I am trying to apply it to the following problem, schematically presented here:
Grid.h
class Grid
{
public:
// has a grid member variable
std::vector<std::vector<int>> 2Dgrid;
// modifies the components of the 2Dgrid, no push_back() etc. used what could possibly disturbe the use of openMP
update_grid(int,int,int,in);
};
Test.h
class Test
{
public:
Grid grid1;
Grid grid2;
update();
repeat_update();
};
Test.cc
.
.
.
Test::repeat_update() {
for(int i=0;i<100000;i++)
update();
}
Test::update() {
int colIndex = 0;
int rowIndex = 0;
int rowIndexPlusOne = rowIndex + 1;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
for (int i = 0; i < DIRECTION_Y; i++) {
// periodic boundry conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++) {
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
rowIndex++;
rowIndexPlusOne++;
}
}
.
.
.
The thing is, the updates done in Test::update(...) could be done in a parallel manner, since the Grid::update(...) only depends on the nearest neighbour of the grid. So for example in the inner loop multiple threads could do the work for colIndex = 0,2,4,..., independetly, that would be the even decomposition. After That the odd indices colIndex=1,3,5,... could be updated. Then the outerloop iterates one forward and the updates in direction x could again be done parallel. I have 16 cores at disposel and doing the parallelization could be a nice time save. But I totally dont have the perspective to see how this could be done, mainly because I dont know how to keep track of the colIndex, rowIndex, etc, since #pragma omp parallel for is applied to the i,j indices. I Would be grateful if somebody can show me the path out of the darkness.

Without knowing exactly what update_grid(int,int,int,int) does, it's kinda tricky to give a definitive answer. You show an embedded pair of loops of the type
for(int i = 0; i < Y; i++)
{
for(int j = 0; j < X; j++)
{
//...
}
}
and assert that the j loop can be done in parallel. This would be an example of fine grained parallelism. You could alternatively parallelize the i loop, in what would be a more coarse grained parallelization. If the amount of work of each individual thread is roughly equal, the coarse graining method has the advantage of less overhead (assuming that the parallelization of the two loops is equivalent).
There are a few things that you have to be careful of when parallelizing the loops. For starters, you increment colIndexPlusOne and colIndex in the inner loop. If you have multiple threads and a single variable for colIndexPlusOne and colIndex, then each thread will increment the variable and/or have race conditions. You can bypass that in several manners, either giving each thread a copy of the variable, or making the increment atomic or critical, or by removing the dependency of the variable altogether and calculating what it should be for each step of the loop on the fly.
I would start with parallelizing the entire update function as such:
Test::update()
{
#pragma omp parallel
{
int colIndex = 0;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
#pragma omp for
for (int i = 0; i < DIRECTION_Y; i++)
{
int rowIndex = i;
int rowIndexPlusOne = rowIndex + 1;
// periodic boundary conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++)
{
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
// The following two can be replaced by j and j+1...
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
// No longer needed:
// rowIndex++;
// rowIndexPlusOne++;
}
}
}
By placing #pragma omp parallel at the beginning, all the variables are local to each thread. Also, at the beginning of the i loop, I assigned rowIndex = i, as at least in the code shown, that is the case. The same could be done for the j loop and colIndex.

Related

Proper openmp directives for nested for loop operating on 1D array

I need to iterate over an array and assign each element according to a calculation that requires some iteration itself. Removing all unnecessary details the program boils down to something like this.
float output[n];
const float input[n] = ...;
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
some_calculation does not alter its arguments, nor does it have an internal state so its thread safe. Looking at the loops, I understand that the outer loop is thread-safe because different iterations output to different memory locations (different output[i]) and the shared elements of input are never altered while the loop runs, but the inner loop is not thread safe because it has a race condition on output[i] because it is altered in all iterations.
Consequently, I'd like to spawn threads and get them working for different values of i but the whole iteration over input should be local to each thread so as not to introduce a race condition on output[i]. I think the following achieves this.
std::array<float, n> output;
const std::array<float, n> input[n];
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
I'm not sure how this handles the inner loop. Threads working on different is should be able to run the loop in parallel but I don't understand if I'm allowing them to without another #pragma omp directive. On the other hand I don't want to accidentally allow threads to run for different values of j over the same i because that introduces a race condition. I'm also not sure if I need some extra specification on how the two arrays should be handled.
Lastly, if this code is in a function that is going to get called repeatedly, does it need the parallel directive or can that be called once before my main loop begins like so.
void iterative_step(const std::array<float, n> &input, const std::array<float, n> &output) {
// Threads have already been spawned
#pragma omp for
for (int i = 0; i < n; ++i) {
output[i] = 0.0f;
for (int j = 0; j < n; ++j) {
output[i] += some_calculation(i, input[j]);
}
}
int main() {
...
// spawn threads once, but not for this loop
#pragma omp parallel
while (...) {
iterative_step(input, output);
}
...
}
I looked through various other questions but they were about different problems with different race conditions and I'm confused as to how to generalize the answers.
You don't want the omp parallel in main. The omp for you use will only create/reuse threads for the following for (int i loop. For any particular value of i, the j loop will run entirely on one thread.
One other thing that would help a little is to compute your output[i] result into a local variable, then store that into output[i] once you're done with the j loop.

How to optimally parallelize nested loops?

I'm writing a program that should run both in serial and parallel versions. Once I get it to actually do what it is supposed to do I started trying to parallelize it with OpenMP (compulsory).
The thing is I can't find documentation or references on when to use what #pragma. So I am trying my best at guessing and testing. But testing is not going fine with nested loops.
How would you parallelize a series of nested loops like these:
for(int i = 0; i < 3; ++i){
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
switch(i){
case 0:
matrix[j][k].a = matrix[j][k] * someValue1;
break;
case 1:
matrix[j][k].b = matrix[j][k] * someValue2;
break;
case 2:
matrix[j][k].c = matrix[j][k] * someValue3;
break;
}
}
}
}
HEIGHT and WIDTH are usually the same size in the tests I have to run. Some test examples are 32x32 and 4096x4096.
matrix is an array of custom structs with attributes a, b and c
someValue is a double
I know that OpenMP is not always good for nested loops but any help is welcome.
[UPDATE]:
So far I've tried unrolling the loops. It boosts performance but am I adding unnecesary overhead here? Am I reusing threads? I tried getting the id of the threads used in each for but didn't get it right.
#pragma omp parallel
{
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
#pragma omp for collapse(2)
for (int j = 0; j < HEIGHT; ++j) {
for (int k = 0; k < WIDTH; ++k) {
//my previous code here
}
}
}
[UPDATE 2]
Apart from unrolling the loop I have tried parallelizing the outer loop (worst performance boost than unrolling) and collapsing the two inner loops (more or less same performance boost as unrolling). This are the times I am getting.
Serial: ~130 milliseconds
Loop unrolling: ~49 ms
Collapsing two innermost loops: ~55 ms
Parallel outermost loop: ~83 ms
What do you think is the safest option? I mean, which should be generally the best for most systems, not only my computer?
The problem with OpenMP is that it's very high-level, meaning that you can't access low-level functionality, such as spawning the thread, and then reusing it. So let me make it clear what you can and what you can't do:
Assuming you don't need any mutex to protect against race conditions, here are your options:
You parallelize your outer-most loop, and that will use 3 threads, and that's the most peaceful solution you're gonna have
You parallelize the first inner loop with, and then you'll have a performance boost only if the overhead of spawning a new thread for every WIDTH element is much smaller the efforts required to perform the most inner loop.
Parallelizing the most inner loop, but this is the worst solution in the world, because you'll respawn the threads 3*HEIGHT times. Never do that!
Not use OpenMP, and use something low-level, such as std::thread, where you can create your own Thread Pool, and push all the operations you want to do in a queue.
Hope this helps to put things in perspective.
Here's another option, one which recognises that distributing the iterations of the outermost loop when there are only 3 of them might lead to very poor load balancing,
i=0
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
i=1
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
i=2
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
for(int k = 0; k < WIDTH; ++k){
...
}
Warning -- check the syntax yourself, this is no more than a sketch of manual loop unrolling.
Try combining this and collapsing the j and k loops.
Oh, and don't complain about code duplication, you've told us you're being scored partly on performance improvements.
You probably want to parallelize this example for simd so the compiler can vectorize, collapse the loops because you use j and k only in the expression matrix[j][k], and because there are no dependencies on any other element of the matrix. If nothing modifies somevalue1, etc., they should be uniform. Time your loop to be sure those really do improve your speed.

Applying OpenMP to particular nested loops in C++

I've a problem in parallelizing a piece of code with openmp, I think that there is a conceptual problem with some operations that have to be made sequentially.
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist_perf[PERF_ROWS];
int array_dist[MAX_ROWS];
#pragma omp parallel for collapse(2)
for (int i = 0; i < MAX_COLUMNS;
i = i + 1 + (i % PERF_CLMN == 0 ? 1:0))
{
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist_perf[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist_perf);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = 0; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
The error is that "collapsed loops are not perfectly nested". I'm using Clion on g++-5. Thanks in advance
First of all, perfectly nested loops have the following form:
for (init1; cond1; inc1)
{
for (init2; cond2; inc2)
{
...
}
}
Notice that the body of the outer loop consists solely of the inner loop and nothing else. This is definitely not the case with your code - you have plenty of other statements following the inner loop.
Second, your outer loop is not in the canonical form required by OpenMP. Canonical are loops for which the number of iterations and the iteration step can be easily pre-determined. Since what you are doing is skip an iteration each time i is a multiple of PERF_CLMN, you can rewrite the loop as:
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
This will create work imbalance depending on whether MAX_COLUMNS is a multiple of the number of threads or not. But there is yet another source or imbalance, namely the conditional evaluation of rank_function(). You should therefore utilise dynamic scheduling.
Now, apparently both array_dist* loops are meant to be private, which they are not in your case and that will result in data races. Either move the definition of the arrays within the loop body or use the private() clause.
#pragma omp parallel for schedule(dynamic) private(array_dist_perf,array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
Now, for some unsolicited optimisation advice: the two inner loops are redundant as the first one is basically doing a subset of the work of the second one. You can optimise the computation and save on memory by using a single array only and let the second loop continue from where the first one ends. The final version of the code should look like:
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist[MAX_ROWS];
#pragma omp parallel for schedule(dynamic) private(array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = PERF_ROWS; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
Another potential for optimisation lies in the fact that input_matrix is not accessed in a cache-friendly way. Transposing it will result in columns data being stored continuously in memory and improve the memory access locality.

openMP for loop increment statment handling

for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.