Generating random numbers in parallel with identical engines fails - c++

I am using the RNG provided by C++11 and I am also toying around with OpenMP. I have assigned an engine to each thread and as a test I give the same seed to each engine. This means that I would expect both threads to yield the exact same sequence of randomly generated numbers. Here is a MWE:
#include <iostream>
#include <random>
using namespace std;
uniform_real_distribution<double> uni(0, 1);
normal_distribution<double> nor(0, 1);
int main()
{
#pragma omp parallel
{
mt19937 eng(0); //GIVE EACH THREAD ITS OWN ENGINE
vector<double> vec;
#pragma omp for
for(int i=0; i<5; i++)
{
nor(eng);
vec.push_back(uni(eng));
}
#pragma omp critical
cout << vec[0] << endl;
}
return 0;
}
Most often I get the output 0.857946 0.857946, but a few times I get 0.857946 0.592845. How is the latter result possible, when the two threads have identical, uncorrelated engines?!

You have to put nor and uni inside the omp parallel region too. Like this:
#pragma omp parallel
{
uniform_real_distribution<double> uni(0, 1);
normal_distribution<double> nor(0, 1);
mt19937 eng(0); //GIVE EACH THREAD ITS OWN ENGINE
vector<double> vec;
Otherwise there will only be one copy of each, when in fact every thread needs its own copy.
Updated to add: I now see that exactly the same problem is discussed in
this stackoverflow thread.

Related

Thread safety of a static random number generator

I have a bunch of threads, each one needs a thread safe random number. Since in my real program threads are spawned and joined repeatedly, I wouldn't like to create random_device and mt19937 each time I enter a new parallel region which calls the same function, so I put them as static:
#include <iostream>
#include <random>
#include <omp.h>
void test(void) {
static std::random_device rd;
static std::mt19937 rng(rd());
static std::uniform_int_distribution<int> uni(1, 1000);
int x = uni(rng);
# pragma omp critical
std::cout << "thread " << omp_get_thread_num() << " | x = " << x << std::endl;
}
int main() {
# pragma omp parallel num_threads(4)
test();
}
I cannot place them as threadprivate because of Error C3057: dynamic initialization of 'threadprivate' symbols is not currently supported. Some sources say random_device and mt19937 are thread safe, but I haven't managed to find any docs which would prove it.
Is this randomization thread safe?
If no, which of the static objects can be left as static to preserve thread safety?
Here is a different approach. I keep a global seeding value so that the random_device is only used once. Since using it can be very slow, I think it is prudent to only use it as rarely as possible.
Instead we increment the seeding value per thread and also per use. That way we avoid the birthday paradox and we minimize the thread-local state to a single integer.
#include <omp.h>
#include <algorithm>
#include <array>
#include <random>
using seed_type = std::array<std::mt19937::result_type, std::mt19937::state_size>;
namespace {
seed_type init_seed()
{
seed_type rtrn;
std::random_device rdev;
std::generate(rtrn.begin(), rtrn.end(), std::ref(rdev));
return rtrn;
}
}
/**
* Provides a process-global random seeding value
*
* Thread-safe (assuming the C++ compiler if standard-conforming.
* Seed is initialized on first call
*/
seed_type global_seed()
{
static seed_type rtrn = init_seed();
return rtrn;
}
/**
* Creates a new random number generator
*
* Operation is thread-safe, Each thread will get its own RNG with a different
* seed. Repeated calls within a thread will create different RNGs, too.
*/
std::mt19937 make_rng()
{
static std::mt19937::result_type sequence_number = 0;
# pragma omp threadprivate(sequence_number)
seed_type seed = global_seed();
static_assert(seed.size() >= 3);
seed[0] += sequence_number++;
seed[1] += static_cast<std::mt19937::result_type>(omp_get_thread_num());
seed[2] += static_cast<std::mt19937::result_type>(omp_get_level());
std::seed_seq sseq(seed.begin(), seed.end());
return std::mt19937(sseq);
}
See also this: How to make this code thread safe with openMP? Monte Carlo two-dimensional integration
For the approach of just increment the seeding value, see this:
https://www.johndcook.com/blog/2016/01/29/random-number-generator-seed-mistakes/
I think threadprivate is the right approach still, and you can obviate the initialization problem by doing a parallel assignment later.
static random_device rd;
static mt19937 rng;
#pragma omp threadprivate(rd)
#pragma omp threadprivate(rng)
int main() {
#pragma omp parallel
rng = mt19937(rd());
#pragma omp parallel
{
stringstream res;
uniform_int_distribution<int> uni(1, 100);
res << "Thread " << omp_get_thread_num() << ": " << uni(rng) << "\n";
cout << res.str();
}
return 0;
}
Btw, note the stringstream: OpenMP has a tendency to split output lines at the << operators.

is a race condition for parallel OpenMP threads reading the same shared data possible?

There is a piece of code
#include <iostream>
#include <array>
#include <random>
#include <omp.h>
class DBase
{
public:
DBase()
{
delta=(xmax-xmin)/n;
for(int i=0; i<n+1; ++i) x.at(i)=xmin+i*delta;
y={1.0, 3.0, 9.0, 15.0, 20.0, 17.0, 13.0, 9.0, 5.0, 4.0, 1.0};
}
double GetXmax(){return xmax;}
double interpolate(double xx)
{
int bin=xx/delta;
if(bin<0 || bin>n-1) return 0.0;
double res=y.at(bin)+(y.at(bin+1)-y.at(bin)) * (xx-x.at(bin))
/ (x.at(bin+1) - x.at(bin));
return res;
}
private:
static constexpr int n=10;
double xmin=0.0;
double xmax=10.0;
double delta;
std::array<double, n+1> x;
std::array<double, n+1> y;
};
int main(int argc, char *argv[])
{
DBase dbase;
const int N=10000;
std::array<double, N> rnd{0.0};
std::array<double, N> src{0.0};
std::array<double, N> res{0.0};
unsigned seed = 1;
std::default_random_engine generator (seed);
for(int i=0; i<N; ++i) rnd.at(i)=
std::generate_canonical<double,std::numeric_limits<double>::digits>(generator);
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
for(int i=0; i<N; ++i) std::cout<<"("<<src.at(i)<<" , "<<res.at(i)
<<") "<<std::endl;
return 0;
}
It seemes to work properly with either #pragma omp parallel for
or without it (checked output). But i can't understand the following things:
1) Different parallel threads access the same arrays x and y of object dbase of class Dbase (i understand that OpenMP implicitly made dbase object shared, i. e. #pragma omp parallel for shared(dbase)). Different threads do not write in these arrays, only read. But when they read, can there be a race condition for their reading from x and y or not? If not, how is it organized that at every moment only 1 thread reads from x and y in interpolate() and different threads do not bother each other? Maybe, there is a local copy of dbase object and its x and y arrays in each OpenMP thread (but it is equivalent to #pragma omp parallel for private(dbase))?
2) Shall i write in such code #pragma omp parallel for shared(dbase) or #pragma omp parallel for is enough?
3) I think that if i placed 1 random number generator inside the for-loop, to make it work properly (not to let its inner state be in race condition), i should write
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
#pragma omp atomic
std::generate_canonical<double,std::numeric_limits<double>::digits>
(generator)
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
The #pragma omp atomic would destroy the increase in performance (would make threads wait) from #pragma omp parallel for. So, the only correct way how to use random numbers inside a parallel region is to have own generator (or seed) for each thread or prepare all needed random numbers before the for-loop. Is it correct?

How does pragma and omp make a difference in these two codes producing same output?

Initially value of ab is 10, then after some delay created by for loop ab is set to 55 and then its printed in this code..
#include <iostream>
using namespace std;
int main()
{
long j, i;
int ab=10 ;
for(i=0; i<1000000000; i++) ;
ab=55;
cout << "\n----------------\n";
for(j=0; j<100; j++)
cout << endl << ab;
return 0;
}
The purpose of this code is also the same but what was expected from this code is the value of ab becomes 55 after some delay and before that the 2nd pragma block should print 10 and then 55 (multithreading) , but the second pragma block prints only after the delay created by the first for loop and then prints only 55.
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
#pragma omp barrier
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
So you want to "observe race conditions" by changing the value of a variable in a first region and printing the value from the second region.
There are a couple of things that prevent you achieving this.
The first (and explicitly stated) is the #pragma omp barrier. This OpenMP statement requests the runtime that threads running the #pragma omp parallel must wait until all threads in the team arrive. This first barrier forces the two threads to be at the barrier, thus at that point ab will have value 55.
The #pragma omp single (and here stated implicitly) contains an implicit `` waitclause, so the team of threads running theparallel region` will wait until this region has finished. Again, this means that ab will have value 55 after the first region has finished.
In order to try to achieve (and note the "try" because that will depend from run to run, depending on several factors [OS thread scheduling, OpenMP thread scheduling, HW resources available...]). You can give a try to this alternative version from yours:
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single nowait
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
BTW, rather than iterating for a long trip-count in your loops, you could use calls such as sleep/usleep.

OpenMP vs C++11 threads

In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.

Duplicate values generated by mt19937

I am working with C++11's random library, and I have a small program that generates a coordinate pair x, y on a circle with unit radius. Here is the simple multithreaded program
#include <iostream>
#include <fstream>
#include <random>
using namespace std;
int main()
{
const double PI = 3.1415;
double angle, radius, X, Y;
int i;
vector<double> finalPositionX, finalPositionY;
#pragma omp parallel
{
vector <double> positionX, positionY;
mt19937 engine(0);
uniform_real_distribution<> uniform(0, 1);
normal_distribution<double> normal(0, 1);
#pragma omp for private(angle, radius, X, Y)
for(i=0; i<1000000; ++i)
{
angle = uniform(engine)*2.0*PI;
radius = sqrt(uniform(engine));
X = radius*cos(angle);
Y = radius*sin(angle);
positionX.push_back(X);
positionY.push_back(Y);
}
#pragma omp barrier
#pragma omp critical
finalPositionX.insert(finalPositionX.end(), positionX.begin(), positionX.end());
finalPositionY.insert(finalPositionY.end(), positionY.begin(), positionY.end());
}
ofstream output_data("positions.txt", ios::out);
output_data.precision(9);
for(unsigned long long temp_var=0; temp_var<(unsigned long long)finalPositionX.size(); temp_var++)
{
output_data << finalPositionX[temp_var]
<< "\t\t\t\t"
<< finalPositionY[temp_var]
<< "\n";
}
output_data.close();
return 0;
}
Question: Many of the x-coordinates appear twice (same with y-coordinates). I don't understand this, since the period of the mt19937 is much longer than 1.000.000. Does anyone have an idea of what is wrong here?
Note: I get the same behavior when I don't multithread the application, so the problem is not related to wrong multithreading.
EDIT As pointed out in one of the answers, I shouldn't use the same seed for both threads - but that is an error I made when formulating this question, in my real program I seem the threads differently.
Using the core part of your code, I wrote this imperfect test but from what I can see the distribution is pretty uniform:
#include <iostream>
#include <fstream>
#include <random>
#include <map>
#include <iomanip>
using namespace std;
int main()
{
int i;
vector<double> finalPositionX, finalPositionY;
std::map<int, int> hist;
vector <double> positionX, positionY;
mt19937 engine(0);
uniform_real_distribution<> uniform(0, 1);
//normal_distribution<double> normal(0, 1);
for(i=0; i<1000000; ++i)
{
double rnum = uniform(engine);
++hist[std::round(1000*rnum)];
}
for (auto p : hist) {
std::cout << std::fixed << std::setprecision(1) << std::setw(2)
<< p.first << ' ' << std::string(p.second/200, '*') << '\n';
}
return 0;
}
and as others already said it is not unexpected to see some values repeated. For the normal distribution, I used the following modification to rnum and hist to test that and it looks good too:
double rnum = normal(engine);
++hist[std::round(10*rnum)];
As described in this article (and a later article by a Stack Overflow contributor), true randomness doesn't distribute perfectly.
Good randomness :
Bad randomness :
I really recommend reading the article, but to summarize it: a RNG has to be unpredictable, which implies that calling it 100 times must not perfectly fill a 10x10 grid.
First of all - just because you get the same number twice doesn't mean it isn't random. If you throw a dice six times, would you expect six different results? See birthday paradox. That being said - you are right that you shouldn't see too much repetition in this particular case.
I'm not familiar with "#pragma omp parallel", but my guess is you are spawning multiple threads that all seed the mt19937 with the same seed (0). You should use different seeds for all threads - e.g. the thread id.