I have a simple program, which generates (using Boost) some initial velocities and position, and calculates the time it takes to propagate a certain distance. Based on the transverse distances (x, y), the final axial (z) velocity is added to a vector. Here is the simple program:
#include <iostream>
#include <boost/random.hpp>
#include <boost/random/normal_distribution.hpp>
using namespace std;
int main()
{
boost::mt19937 engine(static_cast<unsigned int>(time(0)));
boost::normal_distribution<double> nd(0.0, 1.0);
boost::variate_generator< boost::mt19937, boost::normal_distribution<double> > normal_std_one(engine, nd);
double coordX, coordY, coordZ, time;
double velX, velY, velZ;
const double factor = 0.01;
const double distance = 15.0;
vector<double> cont;
int i;
for(i=0; i<1000000000; i++)
{
coordX = factor*normal_std_one();
coordY = factor*normal_std_one();
coordZ = 0.0;
velX = normal_std_one();
velY = normal_std_one();
velZ = 20.0*normal_std_one()+300;
time = distance/velZ;
coordX += velX*time;
coordY += velY*time;
if(sqrt(coordX*coordX + coordY*coordY) < 0.02)
{
cont.push_back(velZ);
}
}
cout << cont.size() << endl;
return 0;
}
I thought a nice addition would be to parallelize the for-loop using OpenMP. This I do by adding the following line just before the loop is initiated:
#pragma omp parallel for
In addition, I have added -fopenmp to the compiler options and `-fopenmp* to the linker settings. My program compiles and links without errors, but when I execute the file I get the message:
Process terminated with status -1073741819 (0 minutes, 2 seconds)
It is not clear to me what I have done wrong here. I am using Windows and g++ (through Code::Blocks IDE).
I post this as an answer but not comment just to accumulate the results and to avoid long list of comments. It works with parallel_for from Microsoft's PPL if you handle std:vector's size properly to avoid out-of-range exception. But the problem is when i exceeds ~20000, the boost::variate_generator cannot handle multiple requests generating APPLICATION_FAULT_INVALID_POINTER_READ error with program's crash.
Update: When used without boost::variate_generator (simply assigning a value to vector's index) on dual core notebook it runs without errors but shows the result the opposite to expected - sequential code runs faster than multithreaded with parallel_for.
You can't use cont.push_back unsynchronised across multiple threads. It's not thread safe. You will need to use a different container, or use some kind of mutex lock on access. You may also need to do something to preserve the order they go into the container if that matters.
Related
I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.
Please see the code snippet below-
#define COUNT 4
while (all_ok())
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
// at each iteration, new data is filled
// in 'trans' and 'in_data' variables
#pragma omp parallel num_threads(COUNT)
{
#pragma omp for
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_clouds[i];
}
}
Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.
Any suggestions? Any alternatives to perform the same operation, please?
Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!
Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:
#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;
const int COUNT = 4;
EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
#pragma omp parallel for num_threads(COUNT)
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_data[i];
}
int main()
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
int n = 500000;
for (int i = 0; i < COUNT; i++)
{
trans[i].setRandom();
in_data[i].setRandom(4,n);
out_data[i].setRandom(4,n);
}
int tries = 3;
int rep = 1;
BenchTimer t;
BENCH(t, tries, rep, foo(trans, in_data, out_data));
std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
return 0;
}
So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.
Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.
For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.
You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here:
https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp
The code does the whole shebang for float/double real and complex.
In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.
I'm writing a program that uses brute-force to solve an equation. Unfortunately, I seem to have an error in my code somewhere, as my program stops at search = 0.19999. Here is the code:
#include <iostream>
#include <cmath>
#include <vector>
#define min -4.0
#define max 6.5
using namespace std;
double fx (double x){
long double result = cos(2*x)-0.4*x;
double scale = 0.00001;
double value = (int)(result / scale) * scale;
return value;
}
int sign (double a){
if(a<0) return -1;
if(a==0) return 0;
else return 1;
}
int main(){
vector <double> results;
double step, interval, start, end, search;
interval=(fabs(min)+fabs(max))/50;
step=0.00001;
start=min;
end=min+interval;
search=start;
while(end <= max){
if(sign(start) != sign(end)){
search=start;
while(search < end){
if(fx(search) == 0) results.push_back(search);
search=search+step;
}
}
start=end;
end=start + interval;
}
for(int i=0; i<results.size(); i++){
cout << results[i] << endl;
}
}
I've been looking at it for quite some time now and I still can't find the error in the code.
The program should check if there is a root in each given interval and, if yes, check every possibility in that interval. If it finds a root, it should push it into the results vector.
I know you already found the answer but I just spotted a problem while trying to find the bug. On line 37 you make the following comparison:
if(fx(search) == 0)
Since your fx function returns double. It's generally not advisable to test using the equal operator when dealing with double precision float numbers. Your result will probably never be exactly 0, then this test will never return true. I think you should use comparison using a maximum error margin, like this:
double maximum_error = 0.005;
if(abs(fx(search)) < maximum_error)
I think that would do the trick in your case. You may find more information on this link
Even if it's working right now, micro changes in your input numbers, CPU architecture or even compiler flags may break your program. It's highly dangerous to compare doubles in C++ like that, even though it's legal to do so.
I've just made a run through the code again and found the error.
if(sign(start) != sign(end))
was the culprit. There will be a root if the values of f(x) for start and end have different signs. Instead, I wrote that the if the signs of start and end are different, there will be a root. Sorry for the fuss.
This question already has an answer here:
OpenMP program is slower than sequential one
(1 answer)
Closed 8 years ago.
I have this code that I parallelized using OpenMP that seems to run slower than the serial version. Here's the relevant fragment of the code:
Out_props ion_out;
#pragma omp parallel for firstprivate(Egx,Egy,vi_inlet,dt,xmin,xmax,ymin,ymax,qmi,dy,Nx) private(ion_out)
for (int i=0;i<Np;i++)
{
ion_out = ApplyReflectionBC(dt,Nx,xmin,xmax,ymin,ymax,qmi,dy,vi_inlet,Egx,Egy,xi_i[2*i],xi_i[1+2*i],vi_i[2*i],vi_i[1+2*i]);
xi_o[1-1+2*i]=ion_out.xout;
xi_o[2-1+2*i]=ion_out.yout;
vi_o[1-1+2*i]=ion_out.vxout;
vi_o[2-1+2*i]=ion_out.vyout;
}
Here outprops is just a structure with 4 members of the double type. The ApplyReflectionBC functions (given below) just applies some operations for each i. All these operations are completely independent of each other. Egx and Egy are 60x60 matrices defined prior to entering this loop and vi_inlet is 2x1 vector. I've tried making ion_out a matrix of size Np to further increase independence, but that seems to make no difference. Everything else inside firstprivate is a double type defined prior to entering this loop.
I'd really appreciate any insights into why this might be running many times slower than the serial version. Thank you!
Out_props ApplyReflectionBC(double dt,int Nx,double xmin,double xmax,double ymin, double ymax,double qmp, double dy, double *vp_inlet,double *Egx,double *Egy, double xpx,double xpy,double vpx,double vpy)
{
Out_props part_out;
double Lgy=ymax-ymin;
double xp_inp[2]={xpx,xpy};
double vp_inp[2]={vpx,vpy};
double xp_out[2];
double vp_out[2];
struct vector
{
double x;
double y;
}vnmf,Ep,xnmf;
if((xp_inp[1-1]>xmin) && (xp_inp[1-1]<xmax) && (xp_inp[2-1]<ymin)) //ONLY below lower wall
{
xp_out[1-1]=xp_inp[1-1];
xp_out[2-1]=ymin;
vp_out[1-1]=vp_inp[1-1];
vp_out[2-1]=-vp_inp[2-1];
}
else if((xp_inp[1-1]<xmin) || (xp_inp[1-1]>xmax) || (xp_inp[2-1]>ymax))
{//Simple Boris Push
xnmf.x=xmin;
xnmf.y=ymin+Lgy*rand()/RAND_MAX;
vnmf.x=vp_inlet[0];
vnmf.y=vp_inlet[1];
//Find E field at x,y
double yjp=ymin+dy*floor((xnmf.y-ymin)/(1.0*dy));
double yjp1p=yjp+dy;
int kp=(yjp-ymin)/dy;
int kpp1=kp+1;
double ylg=xnmf.y-yjp;
double wjk=1.0*(dy-ylg)/(1.0*dy);
double wjkp1=1.0*ylg/(1.0*dy);
Ep.x=wjk*Egx[Nx*kp]+wjkp1*Egx[Nx*kpp1];
Ep.y=wjk*Egy[Nx*kp]+wjkp1*Egy[Nx*kpp1];
do
{
double f=1.0*rand()/RAND_MAX;
xp_out[1-1]=xnmf.x+f*dt*(vnmf.x+qmp*Ep.x*f*dt/2.0);
xp_out[2-1]=xnmf.y+f*dt*(vnmf.y+qmp*Ep.y*f*dt/2.0);
vp_out[1-1]=vnmf.x+qmp*Ep.x*(f-0.5)*dt;
vp_out[2-1]=vnmf.y+qmp*Ep.y*(f-0.5)*dt;
} while((xp_out[1-1]<xmin) || (xp_out[1-1]>xmax) || (xp_out[2-1]<ymin) || (xp_out[2-1]>ymax));
}
else
{
xp_out[1-1]=xp_inp[1-1];
xp_out[2-1]=xp_inp[2-1];
vp_out[1-1]=vp_inp[1-1];
vp_out[2-1]=vp_inp[2-1];
}
part_out.xout=xp_out[0];
part_out.yout=xp_out[1];
part_out.vxout=vp_out[0];
part_out.vyout=vp_out[1];
return part_out;
}
Some points:
First, the firstprivate directive creates a copy of the declared variables into each thread's stack, so that takes some time. Since these variables won't be changed (i.e., read-only), you may declare them as shared.
Second, but causing less impact, the ApplyReflectionBC function takes everything by value, so it will create local copies of each argument. Use references (double &dt, for example).
Edit:
As Hristo pointed out, rand() is the source of your problems. You must replace it with some other random number generator function. For both better random numbers and thread-safety, you may use this Mersenne Twister class (if the LGPL 2.1 isn't a problem): http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/MersenneTwister.h . Just declare it private to your threads, like:
MTRand rng;
#pragma omp parallel for private(rng, ...)
for (..)
{
ApplyReflectionBC(..., rng);
}
Out_props ApplyReflectionBC(...,MTRand &rng)
{
// .... Code ....
xnmf.y=ymin+Lgy*rng.rand(); // MTRand::rand will return a number in the range [0; 1]
// ........
}
I try to write simple application using OpenMP. Unfortunately I have problem with speedup.
In this application I have one while loop. Body of this loop consists of some instructions which should be done sequentially and one for loop. I use #pragma omp parallel for to make this for loop parallel. This loop doesn't have much work, but is called very often.
I prepare two versions of for loop, and run application on 1, 2 and 4cores.
version 1 (4 iterations in for loop): 22sec, 23sec, 26sec.
version 2 (100000 iterations in for loop): 20sec, 10sec, 6sec.
As you can see, when for loop doesn't have much work, time on 2 and 4 cores is higher than on 1core.
I guess the reason is that #pragma omp parallel for creates new threads in each iteration of while loop. So, I would like to ask you - is there any possibility to create threads once (before while loop), and ensure that some job in while loop will be done sequentially?
#include <omp.h>
#include <iostream>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
int main(int argc, char* argv[])
{
double sum = 0;
while (true)
{
// ...
// some work which should be done sequentially
// ...
#pragma omp parallel for num_threads(atoi(argv[1])) reduction(+:sum)
for(int j=0; j<4; ++j) // version 2: for(int j=0; j<100000; ++j)
{
double x = pow(j, 3.0);
x = sqrt(x);
x = sin(x);
x = cos(x);
x = tan(x);
sum += x;
double y = pow(j, 3.0);
y = sqrt(y);
y = sin(y);
y = cos(y);
y = tan(y);
sum += y;
double z = pow(j, 3.0);
z = sqrt(z);
z = sin(z);
z = cos(z);
z = tan(z);
sum += z;
}
if (sum > 100000000)
{
break;
}
}
return 0;
}
Most OpenMP implementations create a number of threads on program startup and keep them for the duration of the program. That is, most implementations don't dynamically create and destroy threads during execution; to do so would hit performance with severe thread management costs. This approach to thread management is consistent with, and appropriate for, the usual use cases for OpenMP.
It is far more likely that the slowdown you see when you increase the number of OpenMP threads is down to imposing a parallel overhead on a loop with a tiny number of iterations. Hristo's answer covers this.
You could move the parallel region outside of the while (true) loop and use the single directive to make the serial part of the code to execute in one thread only. This will remove the overhead of the fork/join model. Also OpenMP is not really useful on thight loops with very small number of iterations (like your version 1). You are basically measuring the OpenMP overhead since the work inside the loop is done really fast - even 100000 iterations with transcendental functions take less than second on current generation CPU (at 2 GHz and roughly 100 cycles per FP instruciton other than addition, it'll take ~100 ms).
That's why OpenMP provides the if(condition) clause that can be used to selectively turn off the parallelisation for small loops:
#omp parallel for ... if(loopcnt > 10000)
for (i = 0; i < loopcnt; i++)
...
It is also advisable to use schedule(static) for regular loops (that is for loops in which every iteration takes about the same time to compute).