Usage of OpenMP reduction clause with nested loops - c++

I have the current version of a function:
void*
function(const Input_st *Data, Output_st *Image)
{
int i,j,r,Offset;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r,Offset)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
Offset = i*Data->NR*Data->NZ + j*Data->NR + r;
Image->pTime[Offset] = function2()
}
}
}
return NULL;
}
It works very well, however I wanted to remove the calculation of the variable Offset and use of a pointer pointing to the member Image->pTimeR and then increment, which can look like following:
void*
function(const Input_st *Data, Output_st *Image)
{
int i, j, r;
double *pTime = Image->pTime;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
*pTime = function2()
pTime++;
}
}
}
return NULL;
}
I get Seg Fault. I assume I need to use the reduction clause like reduction(+:pTime).
First, the purpose here is to speed up the function and I am wondering if such change would significantly speed up? (Like less cache memory used?)
Second, well I tried to benchmark it and failed to do so! I think the problem here can be solved by using a reduction clause, but since loops are nested the problem is not that straightforward to me.

There's no need of any sort of reduction clause here. However,at the moment, all threads use the same pointer and update the same memory location (with race conditions in the value assigned to pTime, hence the crashes I suspect).
So you need to define your pointer in a private way (typically by declaring it within the parallel region, and to set it individually per thread to a meaningful value. Then it can be incremented the way you want.
Here is what the code could look like once fixed (not tested obviously):
void* function( const Input_st *Data, Output_st *Image ) {
#pragma omp parallel for schedule( static ) num_threads( 24 )
for ( int i = 0; i < Data->NX; i++ ) {
double *pTime = Image->pTime + i * Data->NR * Data->NZ;
for ( int j = 0; j < Data->NZ; j++ ) {
for ( int r = 0; r < Data->NR; r++ ) {
*pTime = function2();
pTime++;
}
}
}
return NULL;
}

Related

OpenMP: are variables automatically shared?

We already know that local variables are automatically private.
int i,j;
#pragma omp parallel for private(i,j)
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
//do something
}
}
it is equal to
#pragma omp parallel for
for(int i = 0; i < n; i++) {
for(int j = 0; j < n; j++) {
//do something
}
}
How about the other variables ? are they shared by default or do we need to specify shared() like in the example below ?
bool sorted(int *a, int size)
{
bool sorted = true;
#pragma omp parallel default(none) shared(a, size, sorted)
{
#pragma omp for reduction (&:sorted)
for (int i = 0; i < size - 1; i++) sorted &= (a[i] <= a[i + 1]);
}
return sorted;
}
Can we not specify the shared and work directly with our (a, size, sorted) variables ?
They are shared by default in a parallel section unless there is a default clause specified. Note that this is not the case for all OpenMP directives. For example, this is generally not the case for tasks. You can read that from the OpenMP specification:
In a parallel, teams, or task generating construct, the data-sharing attributes of these variables are determined by the default clause, if present (see Section 5.4.1).
In a parallel construct, if no default clause is present, these variables are shared.
For constructs other than task generating constructs, if no default clause is present, these variables reference the variables with the same names that exist in the enclosing context.
In a target construct, variables that are not mapped after applying data-mapping attribute rules (see Section 5.8) are firstprivate.
You can safely rewrite your code as the following:
bool sorted(int *a, int size)
{
bool sorted = true;
#pragma omp parallel for reduction(&:sorted)
for (int i = 0; i < size - 1; i++)
sorted &= a[i] <= a[i + 1];
return sorted;
}

how can I parallel this loop with innerloops using openMP?

I am not good in parallel programming but I need to do this loop more faster. I know how to use the library but I am not sure about if the output is right because the output is very close to be zero, but is not.
for ( int l = 0; l < s; ++l ) {
double denonimator_l = 0;
for (int g = 0; g < G; ++g ) {
double &pi_gl = pi(g, l) = w[g];
for (int i = 0; i < p; ++i )
pi_gl *= P[g](i, Y(l, i) - 1);
denonimator_l += pi_gl;
}
for (int g = 0; g < G; ++g ) {
double &pi_gl = pi(g, l);
pi_gl /= denonimator_l;
}
}
I need to parallel the four loops and not just the first.
#pragma omp parallel for schedule(dynamic)
So, I am not sure if I put the line above in the first loop is enough. Also, I need the correct output and I don't know if I am doing something wrong if I don't use private variables because I am using read/write operations.
The loops are not prefect nested so I can't use collapse.
Note that each cycle has a temporary loop variable.

Applying OpenMP to particular nested loops in C++

I've a problem in parallelizing a piece of code with openmp, I think that there is a conceptual problem with some operations that have to be made sequentially.
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist_perf[PERF_ROWS];
int array_dist[MAX_ROWS];
#pragma omp parallel for collapse(2)
for (int i = 0; i < MAX_COLUMNS;
i = i + 1 + (i % PERF_CLMN == 0 ? 1:0))
{
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist_perf[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist_perf);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = 0; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
The error is that "collapsed loops are not perfectly nested". I'm using Clion on g++-5. Thanks in advance
First of all, perfectly nested loops have the following form:
for (init1; cond1; inc1)
{
for (init2; cond2; inc2)
{
...
}
}
Notice that the body of the outer loop consists solely of the inner loop and nothing else. This is definitely not the case with your code - you have plenty of other statements following the inner loop.
Second, your outer loop is not in the canonical form required by OpenMP. Canonical are loops for which the number of iterations and the iteration step can be easily pre-determined. Since what you are doing is skip an iteration each time i is a multiple of PERF_CLMN, you can rewrite the loop as:
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
This will create work imbalance depending on whether MAX_COLUMNS is a multiple of the number of threads or not. But there is yet another source or imbalance, namely the conditional evaluation of rank_function(). You should therefore utilise dynamic scheduling.
Now, apparently both array_dist* loops are meant to be private, which they are not in your case and that will result in data races. Either move the definition of the arrays within the loop body or use the private() clause.
#pragma omp parallel for schedule(dynamic) private(array_dist_perf,array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
...
}
Now, for some unsolicited optimisation advice: the two inner loops are redundant as the first one is basically doing a subset of the work of the second one. You can optimise the computation and save on memory by using a single array only and let the second loop continue from where the first one ends. The final version of the code should look like:
else if (PERF_ROWS <= MAX_ROWS && function_switch == true)
{
int array_dist[MAX_ROWS];
#pragma omp parallel for schedule(dynamic) private(array_dist)
for (int i = 0; i < MAX_COLUMNS; i++)
{
if (i % PERF_CLMN == 1) continue;
for (int j = 0; j < PERF_ROWS; j++) //truncation perforation
{
array_dist[j] = abs(input[j] - input_matrix[j][i]);
}
float av = mean(PERF_ROWS, array_dist);
float score = score_func(av);
if (score > THRESHOLD_SCORE)
{
for (int k = PERF_ROWS; k < MAX_ROWS; k++)
{
array_dist[k] = abs(input[k] - input_matrix[k][i]);
}
float av_real = mean(MAX_ROWS, array_dist);
float score_real = score_func(av_real);
rank_function(score_real, i);
}
}
}
Another potential for optimisation lies in the fact that input_matrix is not accessed in a cache-friendly way. Transposing it will result in columns data being stored continuously in memory and improve the memory access locality.

openMP for loop increment statment handling

for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.

Max Reduction Open MP 2.0 Visual Studio 2013 C/C++

I'm new here and this is my first question in this site;
I am doing a simple program to find a maximum value of a vector c that is function of two other vectors a and b. I'm doing it on Microsoft Visual Studio 2013 and the problem is that it only support OpenMP 2.0 and I cannot do a Reduction operation to find directy the max or min value of a vector, because OpenMP 2.0 does not supports this operation.
I'm trying to do the without the constructor reduction with the following code:
for (i = 0; i < NUM_THREADS; i++){
cMaxParcial[i] = - FLT_MAX;
}
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private (i,j,indice)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++){
indice = omp_get_thread_num();
if (c[i*N + j] > cMaxParcial[indice]){
cMaxParcial[indice] = c[i*N + j];
bMaxParcial[indice] = b[j];
aMaxParcial[indice] = a[i];
}
}
}
cMax = -FLT_MAX;
for (i = 0; i < NUM_THREADS; i++){
if (cMaxParcial[i]>cMax){
cMax = cMaxParcial[i];
bMax = bMaxParcial[i];
aMax = aMaxParcial[i];
}
}
I'm getting the error: "The expression must have integral or unscoped enum type"
on the command cMaxParcial[indice] = c[i*N + j];
Can anybody help me with this error?
Normally, the error is caused by one of the indices not being in integer type. Since you haven't shown the code where i, j, N and indice are declared, my guess is that either N or indice is a float or double, but it would be simpler to answer if you had provided a MCVE. However, the line above it seems to have used the same indices correctly. This leads me to believe that it's an IntelliSense error, which often are false positives. Try compiling the code and running it.
Now, on to issues that you haven't (yet) asked about (why is my parallel code slower than my serial code?). You're causing false sharing by using (presumably) contiguous arrays to find the a, b, and c values of each thread. Instead of using a single pragma for parallel and for, split it up like so:
cMax = -FLT_MAX;
#pragma omp parallel
{
float aMaxParcialPerThread;
float bMaxParcialPerThread;
float cMaxParcialPerThread;
#pragma omp for nowait private (i,j)
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
if (c[i*N + j] > cMaxParcialPerThread){
cMaxParcialPerThread = c[i*N + j];
bMaxParcialPerThread = b[j];
aMaxParcialPerThread = a[i];
} // if
} // for j
} // for i
#pragma omp critical
{
if (cMaxParcialPerThread < cMax) {
cMax = cMaxParcialPerThread;
bMax = bMaxParcialPerThread;
aMax = aMaxParcialPerThread;
}
}
}
I don't know what is wrong with your compiler since (as far as I can see with only the partial data you gave), the code seems valid. However, it is a bit convoluted and not so good.
What about the following:
#include <omp.h>
#include <float.h>
extern int N, NUM_THREADS;
extern float aMax, bMax, cMax, *a, *b, *c;
int foo() {
cMax = -FLT_MAX;
#pragma omp parallel num_threads( NUM_THREADS )
{
float localAMax, localBMax, localCMax = -FLT_MAX;
#pragma omp for
for ( int i = 0; i < N; i++ ) {
for ( int j = 0; j < N; j++ ) {
float pivot = c[i*N + j];
if ( pivot > localCMax ) {
localAMax = a[i];
localBMax = b[j];
localCMax = pivot;
}
}
}
#pragma omp critical
{
if ( localCMax > cMax ) {
aMax = localAMax;
bMax = localBMax;
cMax = localCMax;
}
}
}
}
It compiles but I haven't tested it...
Anyway, I avoided using the [a-c]MaxParcial arrays since they will generate false sharing between the threads, leading to poor performance. The final reduction is done based on critical. It is not ideal, but will perform perfectly as long as you have a "moderated" number of threads. If you see some hot spot there or you need to use a "large" number of threads, it can be optimised better with a proper parallel reduction later.