This question already has an answer here:
OpenMP program is slower than sequential one
(1 answer)
Closed 8 years ago.
I have this code that I parallelized using OpenMP that seems to run slower than the serial version. Here's the relevant fragment of the code:
Out_props ion_out;
#pragma omp parallel for firstprivate(Egx,Egy,vi_inlet,dt,xmin,xmax,ymin,ymax,qmi,dy,Nx) private(ion_out)
for (int i=0;i<Np;i++)
{
ion_out = ApplyReflectionBC(dt,Nx,xmin,xmax,ymin,ymax,qmi,dy,vi_inlet,Egx,Egy,xi_i[2*i],xi_i[1+2*i],vi_i[2*i],vi_i[1+2*i]);
xi_o[1-1+2*i]=ion_out.xout;
xi_o[2-1+2*i]=ion_out.yout;
vi_o[1-1+2*i]=ion_out.vxout;
vi_o[2-1+2*i]=ion_out.vyout;
}
Here outprops is just a structure with 4 members of the double type. The ApplyReflectionBC functions (given below) just applies some operations for each i. All these operations are completely independent of each other. Egx and Egy are 60x60 matrices defined prior to entering this loop and vi_inlet is 2x1 vector. I've tried making ion_out a matrix of size Np to further increase independence, but that seems to make no difference. Everything else inside firstprivate is a double type defined prior to entering this loop.
I'd really appreciate any insights into why this might be running many times slower than the serial version. Thank you!
Out_props ApplyReflectionBC(double dt,int Nx,double xmin,double xmax,double ymin, double ymax,double qmp, double dy, double *vp_inlet,double *Egx,double *Egy, double xpx,double xpy,double vpx,double vpy)
{
Out_props part_out;
double Lgy=ymax-ymin;
double xp_inp[2]={xpx,xpy};
double vp_inp[2]={vpx,vpy};
double xp_out[2];
double vp_out[2];
struct vector
{
double x;
double y;
}vnmf,Ep,xnmf;
if((xp_inp[1-1]>xmin) && (xp_inp[1-1]<xmax) && (xp_inp[2-1]<ymin)) //ONLY below lower wall
{
xp_out[1-1]=xp_inp[1-1];
xp_out[2-1]=ymin;
vp_out[1-1]=vp_inp[1-1];
vp_out[2-1]=-vp_inp[2-1];
}
else if((xp_inp[1-1]<xmin) || (xp_inp[1-1]>xmax) || (xp_inp[2-1]>ymax))
{//Simple Boris Push
xnmf.x=xmin;
xnmf.y=ymin+Lgy*rand()/RAND_MAX;
vnmf.x=vp_inlet[0];
vnmf.y=vp_inlet[1];
//Find E field at x,y
double yjp=ymin+dy*floor((xnmf.y-ymin)/(1.0*dy));
double yjp1p=yjp+dy;
int kp=(yjp-ymin)/dy;
int kpp1=kp+1;
double ylg=xnmf.y-yjp;
double wjk=1.0*(dy-ylg)/(1.0*dy);
double wjkp1=1.0*ylg/(1.0*dy);
Ep.x=wjk*Egx[Nx*kp]+wjkp1*Egx[Nx*kpp1];
Ep.y=wjk*Egy[Nx*kp]+wjkp1*Egy[Nx*kpp1];
do
{
double f=1.0*rand()/RAND_MAX;
xp_out[1-1]=xnmf.x+f*dt*(vnmf.x+qmp*Ep.x*f*dt/2.0);
xp_out[2-1]=xnmf.y+f*dt*(vnmf.y+qmp*Ep.y*f*dt/2.0);
vp_out[1-1]=vnmf.x+qmp*Ep.x*(f-0.5)*dt;
vp_out[2-1]=vnmf.y+qmp*Ep.y*(f-0.5)*dt;
} while((xp_out[1-1]<xmin) || (xp_out[1-1]>xmax) || (xp_out[2-1]<ymin) || (xp_out[2-1]>ymax));
}
else
{
xp_out[1-1]=xp_inp[1-1];
xp_out[2-1]=xp_inp[2-1];
vp_out[1-1]=vp_inp[1-1];
vp_out[2-1]=vp_inp[2-1];
}
part_out.xout=xp_out[0];
part_out.yout=xp_out[1];
part_out.vxout=vp_out[0];
part_out.vyout=vp_out[1];
return part_out;
}
Some points:
First, the firstprivate directive creates a copy of the declared variables into each thread's stack, so that takes some time. Since these variables won't be changed (i.e., read-only), you may declare them as shared.
Second, but causing less impact, the ApplyReflectionBC function takes everything by value, so it will create local copies of each argument. Use references (double &dt, for example).
Edit:
As Hristo pointed out, rand() is the source of your problems. You must replace it with some other random number generator function. For both better random numbers and thread-safety, you may use this Mersenne Twister class (if the LGPL 2.1 isn't a problem): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/MersenneTwister.h . Just declare it private to your threads, like:
MTRand rng;
#pragma omp parallel for private(rng, ...)
for (..)
{
ApplyReflectionBC(..., rng);
}
Out_props ApplyReflectionBC(...,MTRand &rng)
{
// .... Code ....
xnmf.y=ymin+Lgy*rng.rand(); // MTRand::rand will return a number in the range [0; 1]
// ........
}
Related
I have a parallel code that does some computation and then adds a double to an outside-the-loop double variable. I tried using std::atomic but it does not have suport for arithmetic operations on std::atomic < double > variables.
double dResCross = 0.0;
std::atomic<double> dResCrossAT = 0.0;
Concurrency::parallel_for(0, iExperimentalVectorLength, [&](size_t m)
{
double value;
//some computation of the double value
atomic_fetch_add(&dResCrossAT, value);
});
dResCross += dResCrossAT;
Simply writing
dResCross += value;
does obviously otput nonsense. My question is, how can I solve this problem, without making the code serial?
A typical way to atomically perform arithmetic operations on a floating-point type is with a compare-and-swap (CAS) loop.
double value;
//some computation of the double value
double expected = atomic_load(&dResCrossAT);
while (!atomic_compare_exchange_weak(&dResCrossAT, &expected, expected + value));
A detailed explanation can be found in Jeff Preshing's article about this class of operation.
I believe excluding partial memory write in a non-atomic variable requires mutexing, I am not certain of that being the only way to ensure there is no write conflict but it is accomplished like this
#include <mutex>
#include <thread>
std::mutex mtx;
void threadFunction(double* d){
while (*d < 100) {
mtx.lock();
*d += 1.0;
mtx.unlock();
}
}
int main() {
double* d = new double(0);
std::thread thread(threadFunction, d);
while (true) {
if (*d == 100) {
break;
}
}
thread.join();
}
Which will add 1.0 to d 100 times in a thread-safe way. The mutex locking and unlocking ensures that only one thread is accessing d at a given time. However, this is significantly slower than an atomic equivalent because locking and unlocking is so expensive - I've heard varying things based on operating system and specific processor and what is being locked or unlocked but it's in the neighborhood of 50 clock cycles for this example, but it can require a system call which is more like 2000 clock cycles.
Moral: use with caution.
If your vector has many elements per thread, you should consider implementing a reduction rather than using an atomic operation for every element. Atomic operations are much more expensive than normal stores.
double global_value{0.0};
std::vector<double> private_values(num_threads,0.0);
parallel_for(size_t k=0; k<n; ++k) {
private_values[my_thread] += ...;
}
if (my_thread==0) {
for (int t=0; t<num_threads; ++t) {
global_value += private_values[t];
}
}
This algorithm requires no atomic operations and will be faster in many cases. You can replace the second phase with a tree or atomics if the thread count is very high (e.g. on a GPU).
Concurrency libraries like TBB and Kokkos both provide parallel reduce templates that do the right thing internally.
The solution that this parallelized returns doesn't match with the solution that it should be return, like the non-parallelized one.
double angle=(PI/180)*atoi(extra);
unsigned int xf;
unsigned int yf;
int j;
#pragma omp parallel for private (j,xf,yf)
for(int i=0;i<width;i++){
for(j=0;j<height;j++){
xf=(unsigned int)ceil(((cos(angle)*(j-(((double)height)/2.0)))-(sin(angle)*(i-(((double)width)/2.0))))+(((double)height)/2.0));
yf=(unsigned int)ceil(((sin(angle)*(j-(((double)height)/2.0)))+(cos(angle)*(i-(((double)width)/2.0))))+(((double)width)/2.0));
if(xf<(unsigned int)height && xf>=0 && yf>=0 && yf<(unsigned int)width){
matrixRed2[yf][xf]=matrixRed[i][j];
matrixGreen2[yf][xf]=matrixGreen[i][j];
matrixBlue2[yf][xf]=matrixBlue[i][j];
}
}
}
I don't see it necessary for j, to be outside the loop since it just contain the temporary value before assigning to array (which have element independent). Moreover, two for loop can be collapsed.
Edited: each xf, yf is not guarantee to be unique with the formula from i, j. So keep the sequential one, Or you can try to check to keep the largest (i, j) if there race condition in (xf, yf).
I'm not sure OpenMP can work on Visual Studio or not, since the supported version is very old (2.0 meanwhile the latest is 4.5).
Perhaps the solution to my problem is obvious for some on with exprience with openmp, but I don't have it. I want to accelerate the following subroutine using openmp:
void Build_ERIS(vector<double> &eris, vector<Atomic_Orbital> &Basis)
{
int basis_size = Basis.size();
int m = basis_size*(basis_size+1)/2;
eris.resize(m*(m+1)/2);
bool compute;
std::fill(eris.begin(), eris.end(), 0);
int i_orbital,j_orbital, k_orbital,l_orbital, i_primitive, j_primitive, k_primitive,l_primitive,ij,kl, ijkl,ijij,klkl;
#pragma omp parallel
{
#pragma omp for ordered
for(i_orbital=0; i_orbital<basis_size; i_orbital++){
for(j_orbital=0; j_orbital<i_orbital+1; j_orbital++){
ij = i_orbital*(i_orbital+1)/2 + j_orbital;
for(k_orbital=0; k_orbital<basis_size; k_orbital++){
for(l_orbital=0; l_orbital<k_orbital+1; l_orbital++){
kl = k_orbital*(k_orbital+1)/2 + l_orbital;
if (ij >= kl) {
ijkl = composite_index(i_orbital,j_orbital,k_orbital,l_orbital);
ijij = composite_index(i_orbital,j_orbital,i_orbital,j_orbital);
klkl = composite_index(k_orbital,l_orbital,k_orbital,l_orbital);
for(i_primitive=0; i_primitive<Basis[i_orbital].contraction.size; i_primitive++)
for(j_primitive=0; j_primitive<Basis[j_orbital].contraction.size; j_primitive++)
for(k_primitive=0; k_primitive<Basis[k_orbital].contraction.size; k_primitive++)
for(l_primitive=0; l_primitive<Basis[l_orbital].contraction.size; l_primitive++)
eris[ijkl] +=
normconst(Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n)*
normconst(Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n)*
normconst(Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n)*
normconst(Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n)*
Basis[i_orbital].contraction.coef[i_primitive]*
Basis[j_orbital].contraction.coef[j_primitive]*
Basis[k_orbital].contraction.coef[k_primitive]*
Basis[l_orbital].contraction.coef[l_primitive]*
ERI_int(Basis[i_orbital].contraction.center.x, Basis[i_orbital].contraction.center.y, Basis[i_orbital].contraction.center.z, Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n,
Basis[j_orbital].contraction.center.x, Basis[j_orbital].contraction.center.y, Basis[j_orbital].contraction.center.z, Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n,
Basis[k_orbital].contraction.center.x, Basis[k_orbital].contraction.center.y, Basis[k_orbital].contraction.center.z, Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n,
Basis[l_orbital].contraction.center.x, Basis[l_orbital].contraction.center.y, Basis[l_orbital].contraction.center.z, Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n);
/**/
}
}
}
}
}
}
}
My concern is regarding the best way of be sure that after the openmp parallelization, the computation of the reductions in eris[ijkl], still giving the same values that the serial version of the routine? How can I do a loops fusion in a way that is numerically safe?
Several things I see.
1) #pragma for ordered means: execute every single one of the iterations of this loop in order. This essentially means that while you're executing "in parallel," all of your work will be done in serial. Remove it.
2) You have not declared any of your variables shared or private. Note that all variables by default will be shared, so in your case ij and kl for instance will be accessible by any thread working on any iteration. You can no doubt see how this would cause a race condition if, say, iteration 100 changed variable ij while iteration 1 thought it was using it.
3) Your variable eris[ijkl] as you rightly noted must be reduced properly. If ijkl can never be the same value for two different iterations in your i_orbital loop, then you're fine as-is; no two threads will ever be changing the same variable eris[ijkl] potentially at the same time. If it can be the same value, then you have to carefully handle reduction on the array.
4) Here's what you should work with for starters. This is assuming that ijkl will never be the same value for two different iterations, and your functions do not take in any non-constant references (potentially changing what I'm assuming input variables to output variables).
#pragma omp parallel for private(i_orbital, j_orbital, ij, k_orbital, l_orbital, kl, ijkl, ijij, klkl, i_primitive, j_primitive, k_primitive, l_primitive)
I've spent time going over other posts but I still can't get this simple program to go.
#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;
int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
dum1=0,dum2=0,dum3=0;
a=0,b=0,c=0;
for (counter=0;counter<steps;counter++)
{
dum1 = somefunct1()+rand();
dum2=somefunct2()+rand();
dum3 = somefunct3(dum1, dum2, ...);
a += somefunct4(dum1,dum2,dum3, ...);
b += somefunct5(dum1,dum2,dum3, ...);
c += somefunct6(dum1,dum2,dum3, ...);
cumulative++; //count number of loops executed
}
pos[dummy][0] = a;//saves results of second loop to array
pos[dummy][1] = b;
pos[dummy][2] = c;
non+= pos[dummy][0];//holds the summed a values
}
}
}
I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.
I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.
The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.
Any help would be greatly appreciated.
Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.
......
float non=0;
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
//RNG declared
#pragma omp for
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
....
I have a simple program, which generates (using Boost) some initial velocities and position, and calculates the time it takes to propagate a certain distance. Based on the transverse distances (x, y), the final axial (z) velocity is added to a vector. Here is the simple program:
#include <iostream>
#include <boost/random.hpp>
#include <boost/random/normal_distribution.hpp>
using namespace std;
int main()
{
boost::mt19937 engine(static_cast<unsigned int>(time(0)));
boost::normal_distribution<double> nd(0.0, 1.0);
boost::variate_generator< boost::mt19937, boost::normal_distribution<double> > normal_std_one(engine, nd);
double coordX, coordY, coordZ, time;
double velX, velY, velZ;
const double factor = 0.01;
const double distance = 15.0;
vector<double> cont;
int i;
for(i=0; i<1000000000; i++)
{
coordX = factor*normal_std_one();
coordY = factor*normal_std_one();
coordZ = 0.0;
velX = normal_std_one();
velY = normal_std_one();
velZ = 20.0*normal_std_one()+300;
time = distance/velZ;
coordX += velX*time;
coordY += velY*time;
if(sqrt(coordX*coordX + coordY*coordY) < 0.02)
{
cont.push_back(velZ);
}
}
cout << cont.size() << endl;
return 0;
}
I thought a nice addition would be to parallelize the for-loop using OpenMP. This I do by adding the following line just before the loop is initiated:
#pragma omp parallel for
In addition, I have added -fopenmp to the compiler options and `-fopenmp* to the linker settings. My program compiles and links without errors, but when I execute the file I get the message:
Process terminated with status -1073741819 (0 minutes, 2 seconds)
It is not clear to me what I have done wrong here. I am using Windows and g++ (through Code::Blocks IDE).
I post this as an answer but not comment just to accumulate the results and to avoid long list of comments. It works with parallel_for from Microsoft's PPL if you handle std:vector's size properly to avoid out-of-range exception. But the problem is when i exceeds ~20000, the boost::variate_generator cannot handle multiple requests generating APPLICATION_FAULT_INVALID_POINTER_READ error with program's crash.
Update: When used without boost::variate_generator (simply assigning a value to vector's index) on dual core notebook it runs without errors but shows the result the opposite to expected - sequential code runs faster than multithreaded with parallel_for.
You can't use cont.push_back unsynchronised across multiple threads. It's not thread safe. You will need to use a different container, or use some kind of mutex lock on access. You may also need to do something to preserve the order they go into the container if that matters.