Paralleling a loop in C++ with OpenMP - c++

The solution that this parallelized returns doesn't match with the solution that it should be return, like the non-parallelized one.
double angle=(PI/180)*atoi(extra);
unsigned int xf;
unsigned int yf;
int j;
#pragma omp parallel for private (j,xf,yf)
for(int i=0;i<width;i++){
for(j=0;j<height;j++){
xf=(unsigned int)ceil(((cos(angle)*(j-(((double)height)/2.0)))-(sin(angle)*(i-(((double)width)/2.0))))+(((double)height)/2.0));
yf=(unsigned int)ceil(((sin(angle)*(j-(((double)height)/2.0)))+(cos(angle)*(i-(((double)width)/2.0))))+(((double)width)/2.0));
if(xf<(unsigned int)height && xf>=0 && yf>=0 && yf<(unsigned int)width){
matrixRed2[yf][xf]=matrixRed[i][j];
matrixGreen2[yf][xf]=matrixGreen[i][j];
matrixBlue2[yf][xf]=matrixBlue[i][j];
}
}
}

I don't see it necessary for j, to be outside the loop since it just contain the temporary value before assigning to array (which have element independent). Moreover, two for loop can be collapsed.
Edited: each xf, yf is not guarantee to be unique with the formula from i, j. So keep the sequential one, Or you can try to check to keep the largest (i, j) if there race condition in (xf, yf).
I'm not sure OpenMP can work on Visual Studio or not, since the supported version is very old (2.0 meanwhile the latest is 4.5).

Related

OpenMP: Is array reduction always needed for updating an array in parallel?

I am quite new to OpenMP. I have the following simple loop that I want to run in parallel with OpenMP:
double rij[3];
double r;
#ifdef _OPENMP
#pragma omp parallel for private(rij,r)
#endif
for (int i=0; i<n; ++i)
{
for (int j=0; j<n; ++j)
{
if (i != j)
{
distance(X,rij,r,i,j);
V[i] += ke * Q[j] / r;
for (int k=0; k<3; ++k)
{
F[3*i+k] += ke * Q[j] * rij[k] / pow(r,3);
}
}
}
}
From what I understood, variables are shared by default which is why I only declared private(rij,r). But according to these questions (first second third), I should do array reduction in this case.
It's clear to me that if many threads need to sum to the same variable, this has to be done with #pragma omp parallel for reduction(+:A[:n]) for summing to array A of size n. This is what I do in another part of my code, and it works as expected.
However, in this case workers never have to sum to the same variable: every worker performs the sum on its index i. Is is correct to do as I do in this case i.e. not doing any array reduction and not using any critical section ?
If my implementation is correct, I believe it would avoid the overhead of the critical section while being simpler code. Feel free to give your advice on how this could be better optimized.
Thank you
You don't need a reduction. It is a feature to avoid copying the same code all over again because they are re-occurring problems (Try to think off, how you would implement a sum-reduction without OpenMP).
What you do right now is working on parallel data (V[i]) which should not overlap at any iteration (as you state in the question), because you divide by i itself. Furthermore write to F[...] shouldn't overlap either, because it only depends on iand k

Parallelizing nested loops with OpenMP

Perhaps the solution to my problem is obvious for some on with exprience with openmp, but I don't have it. I want to accelerate the following subroutine using openmp:
void Build_ERIS(vector<double> &eris, vector<Atomic_Orbital> &Basis)
{
int basis_size = Basis.size();
int m = basis_size*(basis_size+1)/2;
eris.resize(m*(m+1)/2);
bool compute;
std::fill(eris.begin(), eris.end(), 0);
int i_orbital,j_orbital, k_orbital,l_orbital, i_primitive, j_primitive, k_primitive,l_primitive,ij,kl, ijkl,ijij,klkl;
#pragma omp parallel
{
#pragma omp for ordered
for(i_orbital=0; i_orbital<basis_size; i_orbital++){
for(j_orbital=0; j_orbital<i_orbital+1; j_orbital++){
ij = i_orbital*(i_orbital+1)/2 + j_orbital;
for(k_orbital=0; k_orbital<basis_size; k_orbital++){
for(l_orbital=0; l_orbital<k_orbital+1; l_orbital++){
kl = k_orbital*(k_orbital+1)/2 + l_orbital;
if (ij >= kl) {
ijkl = composite_index(i_orbital,j_orbital,k_orbital,l_orbital);
ijij = composite_index(i_orbital,j_orbital,i_orbital,j_orbital);
klkl = composite_index(k_orbital,l_orbital,k_orbital,l_orbital);
for(i_primitive=0; i_primitive<Basis[i_orbital].contraction.size; i_primitive++)
for(j_primitive=0; j_primitive<Basis[j_orbital].contraction.size; j_primitive++)
for(k_primitive=0; k_primitive<Basis[k_orbital].contraction.size; k_primitive++)
for(l_primitive=0; l_primitive<Basis[l_orbital].contraction.size; l_primitive++)
eris[ijkl] +=
normconst(Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n)*
normconst(Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n)*
normconst(Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n)*
normconst(Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n)*
Basis[i_orbital].contraction.coef[i_primitive]*
Basis[j_orbital].contraction.coef[j_primitive]*
Basis[k_orbital].contraction.coef[k_primitive]*
Basis[l_orbital].contraction.coef[l_primitive]*
ERI_int(Basis[i_orbital].contraction.center.x, Basis[i_orbital].contraction.center.y, Basis[i_orbital].contraction.center.z, Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n,
Basis[j_orbital].contraction.center.x, Basis[j_orbital].contraction.center.y, Basis[j_orbital].contraction.center.z, Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n,
Basis[k_orbital].contraction.center.x, Basis[k_orbital].contraction.center.y, Basis[k_orbital].contraction.center.z, Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n,
Basis[l_orbital].contraction.center.x, Basis[l_orbital].contraction.center.y, Basis[l_orbital].contraction.center.z, Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n);
/**/
}
}
}
}
}
}
}
My concern is regarding the best way of be sure that after the openmp parallelization, the computation of the reductions in eris[ijkl], still giving the same values that the serial version of the routine? How can I do a loops fusion in a way that is numerically safe?
Several things I see.
1) #pragma for ordered means: execute every single one of the iterations of this loop in order. This essentially means that while you're executing "in parallel," all of your work will be done in serial. Remove it.
2) You have not declared any of your variables shared or private. Note that all variables by default will be shared, so in your case ij and kl for instance will be accessible by any thread working on any iteration. You can no doubt see how this would cause a race condition if, say, iteration 100 changed variable ij while iteration 1 thought it was using it.
3) Your variable eris[ijkl] as you rightly noted must be reduced properly. If ijkl can never be the same value for two different iterations in your i_orbital loop, then you're fine as-is; no two threads will ever be changing the same variable eris[ijkl] potentially at the same time. If it can be the same value, then you have to carefully handle reduction on the array.
4) Here's what you should work with for starters. This is assuming that ijkl will never be the same value for two different iterations, and your functions do not take in any non-constant references (potentially changing what I'm assuming input variables to output variables).
#pragma omp parallel for private(i_orbital, j_orbital, ij, k_orbital, l_orbital, kl, ijkl, ijij, klkl, i_primitive, j_primitive, k_primitive, l_primitive)

OpenMP double for loop array with stored results

I've spent time going over other posts but I still can't get this simple program to go.
#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;
int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
dum1=0,dum2=0,dum3=0;
a=0,b=0,c=0;
for (counter=0;counter<steps;counter++)
{
dum1 = somefunct1()+rand();
dum2=somefunct2()+rand();
dum3 = somefunct3(dum1, dum2, ...);
a += somefunct4(dum1,dum2,dum3, ...);
b += somefunct5(dum1,dum2,dum3, ...);
c += somefunct6(dum1,dum2,dum3, ...);
cumulative++; //count number of loops executed
}
pos[dummy][0] = a;//saves results of second loop to array
pos[dummy][1] = b;
pos[dummy][2] = c;
non+= pos[dummy][0];//holds the summed a values
}
}
}
I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.
I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.
The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.
Any help would be greatly appreciated.
Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.
......
float non=0;
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
//RNG declared
#pragma omp for
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
....

OpenMP parallel code has not the same output as the serial code

I had to change and extend my algorithm for some signal analysis (using the polyfilterbank technique) and couldn't use my old OpenMP code, but in the new code the results are not as expected (the results in the beginning positions in the array are somehow incorrect in comparison with a serial run [serial code shows the expected result]).
So in the first loop tFFTin I have some FFT data, which I'm multiplicating with a window function.
The goal is that a thread runs the inner loops for each polyphase factor. To avoid locks I use the reduction pragma (no complex reduction is defined by standard, so I use my one where each thread's omp_priv variable gets initialized with the omp_orig [so with tFFTin]). The reason I'm using the ordered pragma is that the results should be added to the output vector in an ordered way.
typedef std::complex<float> TComplexType;
typedef std::vector<TComplexType> TFFTContainer;
#pragma omp declare reduction(complexMul:TFFTContainer:\
transform(omp_in.begin(), omp_in.end(),\
omp_out.begin(), omp_out.begin(),\
std::multiplies<TComplexType>()))\
initializer (omp_priv(omp_orig))
void ConcreteResynthesis::ApplyPolyphase(TFFTContainer& tFFTin, TFFTContainer& tFFTout, TWindowContainer& tWindow, *someparams*) {;
#pragma omp parallel for shared(tWindow) firstprivate(sFFTParams) reduction(complexMul: tFFTin) ordered if(iFFTRawDataLen>cMinParallelSize)
for (int p = 0; p < uPolyphase; ++p) {
int iPolyphaseOffset = p * uFFTLength;
for (int i = 0; i < uFFTLength; ++i) {
tFFTin[i] *= tWindow[iPolyphaseOffset + i]; ///< get FFT input data from raw data
}
#pragma omp ordered
{
//using the overlap and add method
for (int i = 0; i < sFFTParams.uFFTLength; ++i) {
pDataPool->GetFullSignalData(workSignal)[mSignalPos + iPolyphaseOffset + i] += tFFTin[i];
}
}
}
mSignalPos = mSignalPos + mStep;
}
Is there a race condition or something, which makes wrong outputs at the beginning? Or do I have some logic error?
Another issue is, I don't really like my solution with using the ordered pragma, is there a better approach( i tried to use for this also the reduction-model, but the compiler doesn't allow me to use a pointer type for that)?
I think your problem is that you have implemented a very cool custom reduction for tFFTin. But this reduction is applied at the end of the parallel region.
Which is after you use the data in tFFTin. Another thing is what H. Iliev mentions that the second iteration of the outer loop relies on data which is computed in the previous iteration - a classic dependency.
I think you should try parallelizing the inner loops.

OpenMP parallel version running slower than serial [duplicate]

This question already has an answer here:
OpenMP program is slower than sequential one
(1 answer)
Closed 8 years ago.
I have this code that I parallelized using OpenMP that seems to run slower than the serial version. Here's the relevant fragment of the code:
Out_props ion_out;
#pragma omp parallel for firstprivate(Egx,Egy,vi_inlet,dt,xmin,xmax,ymin,ymax,qmi,dy,Nx) private(ion_out)
for (int i=0;i<Np;i++)
{
ion_out = ApplyReflectionBC(dt,Nx,xmin,xmax,ymin,ymax,qmi,dy,vi_inlet,Egx,Egy,xi_i[2*i],xi_i[1+2*i],vi_i[2*i],vi_i[1+2*i]);
xi_o[1-1+2*i]=ion_out.xout;
xi_o[2-1+2*i]=ion_out.yout;
vi_o[1-1+2*i]=ion_out.vxout;
vi_o[2-1+2*i]=ion_out.vyout;
}
Here outprops is just a structure with 4 members of the double type. The ApplyReflectionBC functions (given below) just applies some operations for each i. All these operations are completely independent of each other. Egx and Egy are 60x60 matrices defined prior to entering this loop and vi_inlet is 2x1 vector. I've tried making ion_out a matrix of size Np to further increase independence, but that seems to make no difference. Everything else inside firstprivate is a double type defined prior to entering this loop.
I'd really appreciate any insights into why this might be running many times slower than the serial version. Thank you!
Out_props ApplyReflectionBC(double dt,int Nx,double xmin,double xmax,double ymin, double ymax,double qmp, double dy, double *vp_inlet,double *Egx,double *Egy, double xpx,double xpy,double vpx,double vpy)
{
Out_props part_out;
double Lgy=ymax-ymin;
double xp_inp[2]={xpx,xpy};
double vp_inp[2]={vpx,vpy};
double xp_out[2];
double vp_out[2];
struct vector
{
double x;
double y;
}vnmf,Ep,xnmf;
if((xp_inp[1-1]>xmin) && (xp_inp[1-1]<xmax) && (xp_inp[2-1]<ymin)) //ONLY below lower wall
{
xp_out[1-1]=xp_inp[1-1];
xp_out[2-1]=ymin;
vp_out[1-1]=vp_inp[1-1];
vp_out[2-1]=-vp_inp[2-1];
}
else if((xp_inp[1-1]<xmin) || (xp_inp[1-1]>xmax) || (xp_inp[2-1]>ymax))
{//Simple Boris Push
xnmf.x=xmin;
xnmf.y=ymin+Lgy*rand()/RAND_MAX;
vnmf.x=vp_inlet[0];
vnmf.y=vp_inlet[1];
//Find E field at x,y
double yjp=ymin+dy*floor((xnmf.y-ymin)/(1.0*dy));
double yjp1p=yjp+dy;
int kp=(yjp-ymin)/dy;
int kpp1=kp+1;
double ylg=xnmf.y-yjp;
double wjk=1.0*(dy-ylg)/(1.0*dy);
double wjkp1=1.0*ylg/(1.0*dy);
Ep.x=wjk*Egx[Nx*kp]+wjkp1*Egx[Nx*kpp1];
Ep.y=wjk*Egy[Nx*kp]+wjkp1*Egy[Nx*kpp1];
do
{
double f=1.0*rand()/RAND_MAX;
xp_out[1-1]=xnmf.x+f*dt*(vnmf.x+qmp*Ep.x*f*dt/2.0);
xp_out[2-1]=xnmf.y+f*dt*(vnmf.y+qmp*Ep.y*f*dt/2.0);
vp_out[1-1]=vnmf.x+qmp*Ep.x*(f-0.5)*dt;
vp_out[2-1]=vnmf.y+qmp*Ep.y*(f-0.5)*dt;
} while((xp_out[1-1]<xmin) || (xp_out[1-1]>xmax) || (xp_out[2-1]<ymin) || (xp_out[2-1]>ymax));
}
else
{
xp_out[1-1]=xp_inp[1-1];
xp_out[2-1]=xp_inp[2-1];
vp_out[1-1]=vp_inp[1-1];
vp_out[2-1]=vp_inp[2-1];
}
part_out.xout=xp_out[0];
part_out.yout=xp_out[1];
part_out.vxout=vp_out[0];
part_out.vyout=vp_out[1];
return part_out;
}
Some points:
First, the firstprivate directive creates a copy of the declared variables into each thread's stack, so that takes some time. Since these variables won't be changed (i.e., read-only), you may declare them as shared.
Second, but causing less impact, the ApplyReflectionBC function takes everything by value, so it will create local copies of each argument. Use references (double &dt, for example).
Edit:
As Hristo pointed out, rand() is the source of your problems. You must replace it with some other random number generator function. For both better random numbers and thread-safety, you may use this Mersenne Twister class (if the LGPL 2.1 isn't a problem): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/MersenneTwister.h . Just declare it private to your threads, like:
MTRand rng;
#pragma omp parallel for private(rng, ...)
for (..)
{
ApplyReflectionBC(..., rng);
}
Out_props ApplyReflectionBC(...,MTRand &rng)
{
// .... Code ....
xnmf.y=ymin+Lgy*rng.rand(); // MTRand::rand will return a number in the range [0; 1]
// ........
}