OpenMP: are variables automatically shared? - c++

We already know that local variables are automatically private.
int i,j;
#pragma omp parallel for private(i,j)
for(i = 0; i < n; i++) {
for(j = 0; j < n; j++) {
//do something
}
}
it is equal to
#pragma omp parallel for
for(int i = 0; i < n; i++) {
for(int j = 0; j < n; j++) {
//do something
}
}
How about the other variables ? are they shared by default or do we need to specify shared() like in the example below ?
bool sorted(int *a, int size)
{
bool sorted = true;
#pragma omp parallel default(none) shared(a, size, sorted)
{
#pragma omp for reduction (&:sorted)
for (int i = 0; i < size - 1; i++) sorted &= (a[i] <= a[i + 1]);
}
return sorted;
}
Can we not specify the shared and work directly with our (a, size, sorted) variables ?

They are shared by default in a parallel section unless there is a default clause specified. Note that this is not the case for all OpenMP directives. For example, this is generally not the case for tasks. You can read that from the OpenMP specification:
In a parallel, teams, or task generating construct, the data-sharing attributes of these variables are determined by the default clause, if present (see Section 5.4.1).
In a parallel construct, if no default clause is present, these variables are shared.
For constructs other than task generating constructs, if no default clause is present, these variables reference the variables with the same names that exist in the enclosing context.
In a target construct, variables that are not mapped after applying data-mapping attribute rules (see Section 5.8) are firstprivate.
You can safely rewrite your code as the following:
bool sorted(int *a, int size)
{
bool sorted = true;
#pragma omp parallel for reduction(&:sorted)
for (int i = 0; i < size - 1; i++)
sorted &= a[i] <= a[i + 1];
return sorted;
}

Related

Parallelization of dependent nested loops

I aim to compute a simple N-body program on C++ and I am using OpenMP to speed things up with the computations. At some point, I have nested loops that look like that:
int N;
double* S = new double[N];
double* Weight = new double[N];
double* Coordinate = new double[N];
...
#pragma omp parallel for
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
S[i] += K*Weight[j];
S[j] -= K*Weight[i];
}
}
The issue here is that I do not obtain exactly the same result when removing the #pragma ... I am guessing it has to do with the fact that the second loop is dependent on the integer i, but I don't see how to get past that issue
The problem is that there is a data race during updating S[i] and S[j]. Different threads may read from/write to the same element of the array at the same time, therefore it should be an atomic operation (you have to add #pragma omp atomic) to avoid data race and to ensure memory consistency:
for (int j = 0; j < i; ++j)
{
double K = Coordinate[i] - Coordinate[j];
#pragma omp atomic
S[i] += K*Weight[j];
#pragma omp atomic
S[j] -= K*Weight[i];
}

openMP: call parallel function from parallel region

I'm trying to make my serial programm parallel with openMP. Here is the code where I have a big parallel region with a number of internal "#pragma omp for" sections. In serial version I have a function fftw_shift() which has "for" loops inside it too.
The question is how to rewrite the fftw_shift() function properly in order to already existed threads in the external parallel region could split "for" loops inside with no nested threads.
I'm not sure that my realisation works correctly. There is the way to inline the whole function in parallel region but I'm trying to realise how to deal with it in the described situation.
int fftw_shift(fftw_complex *pulse, fftw_complex *shift_buf, int
array_size)
{
int j = 0; //counter
if ((pulse != nullptr) || (shift_buf != nullptr)){
if (omp_in_parallel()) {
//shift the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size / 2; j++) {
//left to right
shift_buf[(array_size / 2) + j][REAL] = pulse[j][REAL]; //real
shift_buf[(array_size / 2) + j][IMAG] = pulse[j][IMAG]; //imaginary
//right to left
shift_buf[j][REAL] = pulse[(array_size / 2) + j][REAL]; //real
shift_buf[j][IMAG] = pulse[(array_size / 2) + j][IMAG]; //imaginary
}
//rewrite the array
#pragma omp for private(j) //shedule(dynamic)
for (j = 0; j < array_size; j++) {
pulse[j][REAL] = shift_buf[j][REAL]; //real
pulse[j][IMAG] = shift_buf[j][IMAG]; //imaginary
}
return 0;
}
}
....
#pragma omp parallel firstprivate(x, phase) if(array_size >=
OMP_THREASHOLD)
{
// First half-step
#pragma omp for schedule(dynamic)
for (x = 0; x < array_size; x++) {
..
}
// Forward FTW
fftw_shift(pulse_x, shift_buf, array_size);
#pragma omp master
{
fftw_execute(dft);
}
#pragma omp barrier
fftw_shift(pulse_kx, shift_buf, array_size);
...
}
If you call fftw_shift from a parallel region - but not a work-sharing construct (i.e. not in a parallel for), then you can just use omp for just as if you were inside a parallel region. This is called an orphaned directive.
However, your loops just copy data, so don't expect a perfect speedup depending on your system.

Usage of OpenMP reduction clause with nested loops

I have the current version of a function:
void*
function(const Input_st *Data, Output_st *Image)
{
int i,j,r,Offset;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r,Offset)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
Offset = i*Data->NR*Data->NZ + j*Data->NR + r;
Image->pTime[Offset] = function2()
}
}
}
return NULL;
}
It works very well, however I wanted to remove the calculation of the variable Offset and use of a pointer pointing to the member Image->pTimeR and then increment, which can look like following:
void*
function(const Input_st *Data, Output_st *Image)
{
int i, j, r;
double *pTime = Image->pTime;
omp_set_num_threads(24);
#pragma omp parallel for schedule(static) shared(Data,Image),\
private(i,j,r)
for (i = 0; i < Data->NX; i++)
{
for (j = 0; j < (Data->NZ); j++)
{
for (r = 0; r < Data->NR; r++)
{
*pTime = function2()
pTime++;
}
}
}
return NULL;
}
I get Seg Fault. I assume I need to use the reduction clause like reduction(+:pTime).
First, the purpose here is to speed up the function and I am wondering if such change would significantly speed up? (Like less cache memory used?)
Second, well I tried to benchmark it and failed to do so! I think the problem here can be solved by using a reduction clause, but since loops are nested the problem is not that straightforward to me.
There's no need of any sort of reduction clause here. However,at the moment, all threads use the same pointer and update the same memory location (with race conditions in the value assigned to pTime, hence the crashes I suspect).
So you need to define your pointer in a private way (typically by declaring it within the parallel region, and to set it individually per thread to a meaningful value. Then it can be incremented the way you want.
Here is what the code could look like once fixed (not tested obviously):
void* function( const Input_st *Data, Output_st *Image ) {
#pragma omp parallel for schedule( static ) num_threads( 24 )
for ( int i = 0; i < Data->NX; i++ ) {
double *pTime = Image->pTime + i * Data->NR * Data->NZ;
for ( int j = 0; j < Data->NZ; j++ ) {
for ( int r = 0; r < Data->NR; r++ ) {
*pTime = function2();
pTime++;
}
}
}
return NULL;
}

openMP for loop increment statment handling

for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
XY[2*nind] = i;
XY[2*nind + 1] = j;
nind++;
}
}
}
here x = 512 and z = 512 and nind = 0 initially
and XY[2*x*y].
I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. openMP threads will increment nind variable as first come will increment nind firstly. Is there any way to unbind it. ('binding' I mean only can be implemented serially).
A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed:
#pragma omp parallel
{
uint myXY[2*z*x];
uint mynind = 0;
#pragma omp for collapse(2) schedule(dynamic,N)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
myXY[2*mynind] = i;
myXY[2*mynind + 1] = j;
mynind++;
}
}
}
#pragma omp critical(concat_arrays)
{
memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
nind += mynind;
}
}
// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);
int compar(const uint *p1, const uint *p2)
{
if (p1[0] < p2[0])
return -1;
else if (p1[0] > p2[0])
return 1;
else
{
if (p1[1] < p2[1])
return -1;
else if (p1[1] > p2[1])
return 1;
}
return 0;
}
You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N) and load imbalance (for large values of N). The comparison function compar could probably be written in a more optimal way.
The assumption here is that the overhead from merging and sorting the array is small. Whether that will be the case depends on many factors.
Here is a variation on Hristo Iliev's good answer.
The important parameter to act on here is the index of the pairs rather than the pairs themselves.
We can fill private arrays of the pair indices in parallel for each thread. The arrays for each thread will be sorted (irrespective of the scheduling).
The following function merges two sorted arrays
void merge(int *a, int *b, int*c, int na, int nb) {
int i=0, j=0, k=0;
while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
while(i<na) c[k++] = a[i++];
while(j<nb) c[k++] = b[j++];
}
Here is the remaining code
uint nind = 0;
uint *P;
#pragma omp parallel
{
uint myP[x*z];
uint mynind = 0;
#pragma omp for schedule(dynamic) nowait
for(uint k = 0 ; k < x*z; k++) {
if (inFunc(p, index)) myP[mynind++] = k;
}
#pragma omp critical
{
uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
merge(P, myP, t, nind, mynind);
free(P);
P = t;
nind += mynind;
}
}
Then given an index k in P the pair is (k/z, k%z).
The merging can be improved. Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())). I did not bother with this.
Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. I think in practice they are but it's not guaranteed in principle.
If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand.
The code you provide looks like you are trying to fill the XY data in sequential order. In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially.
Also what do you want to achieve by optimizing it? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion.
If you do want parallel execution - map your indexes to the array, e.g. (not tested, but should do)
#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
for (uint j = 0; j < z; j++) {
if (inFunc(p, index)) {
uint idx = (2 * i) * x + 2 * j;
XY[idx] = i;
XY[idx + 1] = j;
}
}
}
However, you will have gaps in your array XY then. Which may or may not be a problem for you.

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.