openMP for loop increment statment handling - c++

for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).

A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.

Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.

The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.

Related

Parallelism on inner selected sort loop skips few numbers

I am trying to make a program, where you can set the amount of threads you want, and it will parallelize selection sort algorithm with the given data and amount of threads. I know I should just use another algorithm in this case, but its for educational purposes only. So I run into a problem when parallelizing inner loop in selection sort algorithm some close numbers are left unsorted, but the whole array is sorted apart those few pairs of numbers inside and I cant find out why.
int* selectionSort(int arr[], int size, int numberOfThreads)
{
int i, j;
int me, n, min_idx;
bool canSwap = false;
#pragma omp parallel num_threads(numberOfThreads) private(i,j,me,n)
{
me = omp_get_thread_num();
n = omp_get_num_threads();
printf("Hello from %d/%d\n", me, n);
for (i = 0; i < size - 1; i++) {
min_idx = i;
canSwap = true;
#pragma omp barrier
#pragma omp for
for (j = i + 1; j < size; j++) {
if (arr[j] < arr[min_idx])
min_idx = j;
//printf("I am %d processing %d,%d\n", me, i, j);
}
printf("Min value %d ---- %d \n", arr[min_idx], min_idx);
#pragma omp critical(swap)
if(canSwap)
{
swap(&arr[min_idx], &arr[i]);
canSwap = false;
}
#pragma omp barrier
}
}
return arr;
}
I found out that the problem is that you can't really parallelize this algorithm (well at least in a way I'm doing it), since I'm comparing the arr[j] with arr[min_idx],
min_idx value can sometimes get changed in such particular time that other thread will have finished the if (arr[j] < arr[min_idx]) line and right after that another thread would change the min_idx value which would sometimes make just completed if statement not true anymore.

Usage of OpenMP reduction clause with nested loops

I have the current version of a function:
void*
function(const Input_st *Data, Output_st *Image)
{
int i,j,r,Offset;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r,Offset)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
Offset = i*Data->NR*Data->NZ + j*Data->NR + r;
Image->pTime[Offset] = function2()
}
}
}
return NULL;
}
It works very well, however I wanted to remove the calculation of the variable Offset and use of a pointer pointing to the member Image->pTimeR and then increment, which can look like following:
void*
function(const Input_st *Data, Output_st *Image)
{
int i, j, r;
double *pTime = Image->pTime;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
*pTime = function2()
pTime++;
}
}
}
return NULL;
}
I get Seg Fault. I assume I need to use the reduction clause like reduction(+:pTime).
First, the purpose here is to speed up the function and I am wondering if such change would significantly speed up? (Like less cache memory used?)
Second, well I tried to benchmark it and failed to do so! I think the problem here can be solved by using a reduction clause, but since loops are nested the problem is not that straightforward to me.
There's no need of any sort of reduction clause here. However,at the moment, all threads use the same pointer and update the same memory location (with race conditions in the value assigned to pTime, hence the crashes I suspect).
So you need to define your pointer in a private way (typically by declaring it within the parallel region, and to set it individually per thread to a meaningful value. Then it can be incremented the way you want.
Here is what the code could look like once fixed (not tested obviously):
void* function( const Input_st *Data, Output_st *Image ) {
#pragma omp parallel for schedule( static ) num_threads( 24 )
for ( int i = 0; i < Data->NX; i++ ) {
double *pTime = Image->pTime + i * Data->NR * Data->NZ;
for ( int j = 0; j < Data->NZ; j++ ) {
for ( int r = 0; r < Data->NR; r++ ) {
*pTime = function2();
pTime++;
}
}
}
return NULL;
}

Applying OpenMP to particular nested loops in C++

I've a problem in parallelizing a piece of code with openmp, I think that there is a conceptual problem with some operations that have to be made sequentially.
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist_perf[PERF_ROWS];
int array_dist[MAX_ROWS];
#pragma omp parallel for collapse(2)
for (int i = 0; i < MAX_COLUMNS;
i = i + 1 + (i % PERF_CLMN == 0 ? 1:0))
{
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist_perf[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist_perf);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = 0; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
The error is that "collapsed loops are not perfectly nested". I'm using Clion on g++-5. Thanks in advance
First of all, perfectly nested loops have the following form:
for (init1; cond1; inc1)
{
for (init2; cond2; inc2)
{
...
}
}
Notice that the body of the outer loop consists solely of the inner loop and nothing else. This is definitely not the case with your code - you have plenty of other statements following the inner loop.
Second, your outer loop is not in the canonical form required by OpenMP. Canonical are loops for which the number of iterations and the iteration step can be easily pre-determined. Since what you are doing is skip an iteration each time i is a multiple of PERF_CLMN, you can rewrite the loop as:
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
This will create work imbalance depending on whether MAX_COLUMNS is a multiple of the number of threads or not. But there is yet another source or imbalance, namely the conditional evaluation of rank_function(). You should therefore utilise dynamic scheduling.
Now, apparently both array_dist* loops are meant to be private, which they are not in your case and that will result in data races. Either move the definition of the arrays within the loop body or use the private() clause.
#pragma omp parallel for schedule(dynamic) private(array_dist_perf,array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
Now, for some unsolicited optimisation advice: the two inner loops are redundant as the first one is basically doing a subset of the work of the second one. You can optimise the computation and save on memory by using a single array only and let the second loop continue from where the first one ends. The final version of the code should look like:
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist[MAX_ROWS];
#pragma omp parallel for schedule(dynamic) private(array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = PERF_ROWS; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
Another potential for optimisation lies in the fact that input_matrix is not accessed in a cache-friendly way. Transposing it will result in columns data being stored continuously in memory and improve the memory access locality.

Max Reduction Open MP 2.0 Visual Studio 2013 C/C++

I'm new here and this is my first question in this site;
I am doing a simple program to find a maximum value of a vector c that is function of two other vectors a and b. I'm doing it on Microsoft Visual Studio 2013 and the problem is that it only support OpenMP 2.0 and I cannot do a Reduction operation to find directy the max or min value of a vector, because OpenMP 2.0 does not supports this operation.
I'm trying to do the without the constructor reduction with the following code:
for (i = 0; i < NUM_THREADS; i++){
cMaxParcial[i] = - FLT_MAX;
}
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private (i,j,indice)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
indice = omp_get_thread_num();
if (c[i*N + j] > cMaxParcial[indice]){
cMaxParcial[indice] = c[i*N + j];
bMaxParcial[indice] = b[j];
aMaxParcial[indice] = a[i];
}
}
}
cMax = -FLT_MAX;
for (i = 0; i < NUM_THREADS; i++){
if (cMaxParcial[i]>cMax){
cMax = cMaxParcial[i];
bMax = bMaxParcial[i];
aMax = aMaxParcial[i];
}
}
I'm getting the error: "The expression must have integral or unscoped enum type"
on the command cMaxParcial[indice] = c[i*N + j];
Can anybody help me with this error?
Normally, the error is caused by one of the indices not being in integer type. Since you haven't shown the code where i, j, N and indice are declared, my guess is that either N or indice is a float or double, but it would be simpler to answer if you had provided a MCVE. However, the line above it seems to have used the same indices correctly. This leads me to believe that it's an IntelliSense error, which often are false positives. Try compiling the code and running it.
Now, on to issues that you haven't (yet) asked about (why is my parallel code slower than my serial code?). You're causing false sharing by using (presumably) contiguous arrays to find the a, b, and c values of each thread. Instead of using a single pragma for parallel and for, split it up like so:
cMax = -FLT_MAX;
#pragma omp parallel
{
float aMaxParcialPerThread;
float bMaxParcialPerThread;
float cMaxParcialPerThread;
#pragma omp for nowait private (i,j)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (c[i*N + j] > cMaxParcialPerThread){
cMaxParcialPerThread = c[i*N + j];
bMaxParcialPerThread = b[j];
aMaxParcialPerThread = a[i];
} // if
} // for j
} // for i
#pragma omp critical
{
if (cMaxParcialPerThread < cMax) {
cMax = cMaxParcialPerThread;
bMax = bMaxParcialPerThread;
aMax = aMaxParcialPerThread;
}
}
}
I don't know what is wrong with your compiler since (as far as I can see with only the partial data you gave), the code seems valid. However, it is a bit convoluted and not so good.
What about the following:
#include <omp.h>
#include <float.h>
extern int N, NUM_THREADS;
extern float aMax, bMax, cMax, *a, *b, *c;
int foo() {
cMax = -FLT_MAX;
#pragma omp parallel num_threads( NUM_THREADS )
{
float localAMax, localBMax, localCMax = -FLT_MAX;
#pragma omp for
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < N; j++ ) {
float pivot = c[i*N + j];
if ( pivot > localCMax ) {
localAMax = a[i];
localBMax = b[j];
localCMax = pivot;
}
}
}
#pragma omp critical
{
if ( localCMax > cMax ) {
aMax = localAMax;
bMax = localBMax;
cMax = localCMax;
}
}
}
}
It compiles but I haven't tested it...
Anyway, I avoided using the [a-c]MaxParcial arrays since they will generate false sharing between the threads, leading to poor performance. The final reduction is done based on critical. It is not ideal, but will perform perfectly as long as you have a "moderated" number of threads. If you see some hot spot there or you need to use a "large" number of threads, it can be optimised better with a proper parallel reduction later.

OpenMP even/odd decomposition of a nested loop

I have part in my code that could be done parallel, so I started to read about openMP and did these introduction examples. Now I am trying to apply it to the following problem, schematically presented here:
Grid.h
class Grid
{
public:
// has a grid member variable
std::vector<std::vector<int>> 2Dgrid;
// modifies the components of the 2Dgrid, no push_back() etc. used what could possibly disturbe the use of openMP
update_grid(int,int,int,in);
};
Test.h
class Test
{
public:
Grid grid1;
Grid grid2;
update();
repeat_update();
};
Test.cc
.
.
.
Test::repeat_update() {
for(int i=0;i<100000;i++)
update();
}
Test::update() {
int colIndex = 0;
int rowIndex = 0;
int rowIndexPlusOne = rowIndex + 1;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
for (int i = 0; i < DIRECTION_Y; i++) {
// periodic boundry conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++) {
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
rowIndex++;
rowIndexPlusOne++;
}
}
.
.
.
The thing is, the updates done in Test::update(...) could be done in a parallel manner, since the Grid::update(...) only depends on the nearest neighbour of the grid. So for example in the inner loop multiple threads could do the work for colIndex = 0,2,4,..., independetly, that would be the even decomposition. After That the odd indices colIndex=1,3,5,... could be updated. Then the outerloop iterates one forward and the updates in direction x could again be done parallel. I have 16 cores at disposel and doing the parallelization could be a nice time save. But I totally dont have the perspective to see how this could be done, mainly because I dont know how to keep track of the colIndex, rowIndex, etc, since #pragma omp parallel for is applied to the i,j indices. I Would be grateful if somebody can show me the path out of the darkness.
Without knowing exactly what update_grid(int,int,int,int) does, it's kinda tricky to give a definitive answer. You show an embedded pair of loops of the type
for(int i = 0; i < Y; i++)
{
for(int j = 0; j < X; j++)
{
//...
}
}
and assert that the j loop can be done in parallel. This would be an example of fine grained parallelism. You could alternatively parallelize the i loop, in what would be a more coarse grained parallelization. If the amount of work of each individual thread is roughly equal, the coarse graining method has the advantage of less overhead (assuming that the parallelization of the two loops is equivalent).
There are a few things that you have to be careful of when parallelizing the loops. For starters, you increment colIndexPlusOne and colIndex in the inner loop. If you have multiple threads and a single variable for colIndexPlusOne and colIndex, then each thread will increment the variable and/or have race conditions. You can bypass that in several manners, either giving each thread a copy of the variable, or making the increment atomic or critical, or by removing the dependency of the variable altogether and calculating what it should be for each step of the loop on the fly.
I would start with parallelizing the entire update function as such:
Test::update()
{
#pragma omp parallel
{
int colIndex = 0;
int colIndexPlusOne = colIndex + 1;
// DIRECTION_X (grid[0].size()), DIRECTION_Y (grid.size) are the size of the grid
#pragma omp for
for (int i = 0; i < DIRECTION_Y; i++)
{
int rowIndex = i;
int rowIndexPlusOne = rowIndex + 1;
// periodic boundary conditions
if (rowIndexPlusOne > DIRECTION_Y - 1)
rowIndexPlusOne = 0;
// The following could be done parallel!!!
for (int j = 0; j < DIRECTION_X - 1; j++)
{
grid1.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
grid2.update_grid(rowIndex,colIndex,rowIndexPlusOne,colIndexPlusOne);
// The following two can be replaced by j and j+1...
colIndexPlusOne++;
colIndex++;
}
colIndex = 0;
colIndexPlusOne = 1;
// No longer needed:
// rowIndex++;
// rowIndexPlusOne++;
}
}
}
By placing #pragma omp parallel at the beginning, all the variables are local to each thread. Also, at the beginning of the i loop, I assigned rowIndex = i, as at least in the code shown, that is the case. The same could be done for the j loop and colIndex.