openmp private/shared data in a MC simulation - c++

I'm simulating a stochastic differential equation with a monte carlo method, which in principle is perfectly suited for openMP, as different realizations do not depend on each other. Unfortunately I'm facing some problems with my code, which produces wrong result as soon as I turn on openMP. Without it, it works perfectly fine. My 'critical' loop looks like this:
double price = 0.0
#pragma omp parallel for private(VOld, VNew)
for (long i = 0; i < NSim; ++i){
VOld = S_0;
for (long index = 0; index < Nt; ++index){
VNew = VOld + (dt * r * VOld) + (sqrdt * sig * VOld * dW());
VOld = VNew;
}
double tmp = myOption.PayOff(VNew);
price += (tmp)/double(NSim);
}
I would truly appreciate any help. Thank you in advance :-)

A common mistake is forgetting that each thread must have its own random number generator. If that's not the case, then each call to dW will be messing up with the internal state of the (shared, instead of private) random number generator.
I hope this helps.

Well one problem I see is that you have a race condition on the variable price. You should be doing a reduction
#pragma omp parallel for private(VOld, VNew) reduction(+:price)
The same goes for your variable OptionPrice
Also it looks to me like rng is still shared, not private. You should define it in the parallel block if you want it private or declare it private (for private variables I prefer to declare them int the parallel block which automatically makes them private rather than declare them private).

Ok, so based on #jmbr and #raxman answers I moved the inner loop to a separate function, and made sure that rng is now really private. Also, note the seeding trick, which turns up vital. On top of that I introduced reduction on the OptionPrice. The code below works fine.
double SimulateStockPrice(const double InitialPrize, const double dt, const long Nt, const double r, const double sig, boost::mt19937 *rng){
static unsigned long seed = 0;
boost::mt19937 *rng = new boost::mt19937();
rng -> seed((++seed) + time(NULL));
boost::normal_distribution<> nd(0.0, 1.0);
boost::variate_generator< boost::mt19937, boost::normal_distribution<> > dW(*rng, nd);
double sqrdt = sqrt(dt);
double PriceNew(0.0), PriceOld(InitialPrize);
for (long index = 0; index < Nt; ++index){
PriceNew = PriceOld + (dt * r * PriceOld) + (sqrdt * sig * PriceOld * dW());
PriceOld = PriceNew;
}
delete rng;
return PriceNew;
}
Then in the big loop I go with:
#pragma omp parallel for default(none) shared(dt, NSim, Nt, S_0, myOption) reduction(+:OptionPrice)
for (long i = 0; i < NSim; ++i){
double StockPrice = SimulateStockPrice(S_0, dt, Nt, myOption.r, myOption.sig, rng);
double PayOff = myOption.myPayOffFunction(StockPrice);
OptionPrice += PayOff;
}
And off you go :-)

Related

C++ boost library to generate negative binomial random variables

I'm new to C++ and I'm using the boost library to generate random variables. I want to generate random variables from a negative binomial distribution.
The first parameter of boost::random::negative_binomial_distribution<int> freq_nb(r, p); has to be an integer. I want to expand that to a real value. Therefore I would like to use a poisson-gamma mixture, but I fail to.
Here's an excerpt from my code:
int nr_sim = 1000000;
double mean = 2.0;
double variance = 15.0;
double r = mean * mean / (variance - mean);
double p = mean / variance;
double beta = (1 - p) / p;
typedef boost::mt19937 RNGType;
RNGType rng(5);
boost::random::gamma_distribution<double> my_gamma(r, beta);
boost::random::poisson_distribution<int> my_poi(my_gamma(rng));
int simulated_mean = 0;
for (int i = 0; i < nr_sim; i++) {
simulated_mean += my_poi(rng);
}
double my_result = (double)simulated_mean / (double)nr_sim;
With my_result == 0.5 there is definitly something wrong. Is it my_poi(my_gamma(rng))? If so, what is the correct way to solve that problem?

Are floating point operations faster than int operations?

So I'm doing a little benchmark for measuring operations per second for different operator/type combinations on C++ and now I'm stuck. My tests for +/int and +/float looks like
int int_plus(){
int x = 1;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
float float_plus(){
float x = 1.0;
for (int i = 0; i < NUM_ITERS; ++i){
x += i;
}
return x;
}
And time measurement looks like
//same for float
start = chrono::high_resolution_clock::now();
int res_int = int_plus();
end = chrono::high_resolution_clock::now();
diff = end - start;
ops_per_sec = NUM_ITERS / (diff.count() / 1000);
When I run tests I get
3.65606e+08 ops per second for int_plus
3.98838e+08 ops per second for float plus
But as I understand float operations is always slower than int operations, but my tests show greater value on float type.
So there is the question: Am I wrong or there's something wrong with code? Or, maybe, something else?
There's a few things that could be going on. Optimization can be part of it. Using a #define constant could be letting the compiler do who-knows-what.
Note that the loop code is also being counted. Now, that's a constant for both loops, but it's part of your time, and that means you're doing a lot of int operations, not just 1 * NUM_ITERS.
If NUM_ITERS is relatively small, then the execution time is going to be very low, and that means the overhead of a method call probably dwarfs the cost of the operations inside the method.
Optimization level will also matter.
I'm not sure what else.

Parallelization for three loops of a C++ code?

How can i parallelize this code using openmp:
xp, yp, zp, gpx, gpy, and gpz are known 1D vectors.
for (ies = 0; ies < 1000000; ies++){
for (jes = ies+1; jes < 1000000; jes++){
double dxp = xp[ies] - xp[jes];
double dyp = yp[ies] - yp[jes];
double dzp = zp[ies] - zp[jes];
double distance = sqrt( dxp * dxp + dyp * dyp + dzp * dzp );
double gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];
#pragma omp parallel for
for (kes = 1; kes <= 100; kes++){
double distan = kes * distance;
E1[kes] = E1[kes] + gpspec * sin(distan) / distan;
}
}
}
Here is a possibility (not tested)
#pragma omp parallel for reduction(+: E1) private(jes, kes) schedule(dynamic)
for (ies = 0; ies < 1000000; ies++){
for (jes = ies+1; jes < 1000000; jes++){
double dxp = xp[ies] - xp[jes];
double dyp = yp[ies] - yp[jes];
double dzp = zp[ies] - zp[jes];
double distance = sqrt( dxp * dxp + dyp * dyp + dzp * dzp );
double gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];
for (kes = 1; kes <= 100; kes++){
double distan = kes * distance;
E1[kes] = E1[kes] + gpspec * sin(distan) / distan;
}
}
}
I've put a schedule(dynamic) to try to compensate the workload imbalance between threads introduced by the triangular aspect of the index domain ies * jes that the loops cover.
Also depending on the way E1 is defined, this may or may not be accepted by your compiler. But in any case, if the reduction(+: E1) isn't accepted, there's always the possibility to do the reduction by hand with a critical construct.
You already have an omp parallel for pragma on the innermost loop. To give that effect, you probably need to enable OpenMP support in your compiler by setting a compiler flag (for example, with the GCC compiler suite, that would be the -fopenmp flag). You may also need to #include the omp.h header.
But with that said, I doubt you're going to gain much from this parallelization, because one run of the loop you are parallelizing just doesn't do much work. There is runtime overhead associated with parallelization that offsets the gains from running multiple loop iterations at the same time, so I don't think you're going to net very much.

C++ with OpenMP thread safe random numbers

I am trying to draw some random points, and then calculate smth with them. I am using few threads, but my random is not so random as it supposed to be... I mean when I am using rand() I gets correct answer, but very slow(because of static rand), so I am using rand_r with seed, but the answer of my program is always wird.
double randomNumber(unsigned int seed, double a, double b) {
return a + ((float)rand_r(&seed))/(float)(RAND_MAX) * (b-a);
}
my program:
#pragma omp parallel
for(int i = 0; i < points; i++){
seedX = (i+1) * time(NULL);
seedY = (points - i) * time(NULL);
punkt.x = randomNumber(seedX, minX, maxX);
punkt.y = randomNumber(seedY, minY, maxY);
...
}
I found some solution in other topics(some mt19937 generators etc), but i cant compile anything.
I am using g++ -fopenmp for compiling.(g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2)
edit:
seed = rand();
#pragma omp parallel
for(int i = 0; i < points; i++){
punkt.x = randomNumber(seed, minX, maxX);
punkt.y = randomNumber(seed, minY, maxY);
...
}
Re-seeding your generators within each iteration of the for loop is going to ruin their statistical properties.
Also, it's likely that you'll introduce correlation between your x and y values if you extract them using two linear congruential generators.
Keep it simple; use one generator, and one seed.
Going forward, I'd recommend you use mt19937 as it will have better properties still. Linear congruential generators can fail a chi squared test for autocorrelation which is particularly important if you are using it for an x, y plot.
I believe that others are trying to say is, seed one in constructor with srand(some number), then do not seed anymore.
class someRandomNumber
{
}

Parallel for with omp

I try to optimise the following loop with OpenMP:
#pragma omp parallel for private(diff)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
#pragma omp atomic
d2 += diff * diff;
}
But it runs actually 4x slower than without #pragma.
EDIT
As Piotr S., coincoin and erenon pointed out, in my case x.d is so small, that's why parallelism makes my code run slower. I post the outer loop too, maybe there is some possibility for multithreading: (x.n is over 100 millions)
float sum_distribution = 0.0;
// look for the point that is furthest from any center
float max_dist = 0.0;
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
//#pragma omp parallel for private(diff) reduction(+:d2)
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
if (dist2[i].first > max_dist) {
max_dist = dist2[i].first;
}
sum_distribution += dist2[i].first;
}
If someone is interested, here is the whole function: https://github.com/ghamerly/baylorml/blob/master/fast_kmeans/general_functions.cpp#L169, but as I measured 85% of the elapsed time comes from this loop.
Yes, the outer loop, as posted, can be parallelized with OpenMP.
All variables modified in the loop are either local to an iteration or are used for aggregation over the loop. And I assume that calls to x() in the calculation of diff have no side effects.
To do aggregation in parallel correctly and efficiently, you need to use an OpenMP loop with reduction clause. For sum_distribution the reduction operation is +, and for max_dist it's max. So, adding the following pragma in front of the outer loop should do the job:
#pragma omp parallel for reduction(+:sum_distribution) reduction(max:max_dist)
Note that max as a reduction operation can only be used since OpenMP 3.1. It's not that new, so most OpenMP-enabled compilers already support it, but not all; or you might use an older version. So it makes sense to consult with the documentation for your compiler.