Using same random number generator across multiple functions

Using same random number generator across multiple functions - c++

I am given to believe that random number generators (RNGs) should only be seeded once to ensure that the distribution of results is as intended.
I am writing a Monte Carlo simulation in C++ which consists of a main function ("A") calling another function ("B") several times, where a large quantity of random numbers is generated in B.
Currently, I am doing the following in B:
void B(){
std::array<int, std::mt19937::state_size> seed_data;
std::random_device r;
std::generate(seed_data.begin(), seed_data.end(), std::ref(r));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data)); //perform warmup
std::mt19937 eng(seq);
std::uniform_real_distribution<> randU(0,1);
double myRandNum = randU(eng);
//do stuff with my random number
}
As you can see, I am creating a new random number generator each time I call the function B. This, as far as I can see, is a waste of time - the RNG can still generate a lot more random numbers!
I have experimented with making "eng" extern but this generates an error using g++:
error: ‘eng’ has both ‘extern’ and initializer extern std::mt19937 eng(seq);
How can I make the random number generator "global" so that I can use it many times?

Be careful with one-size-fits-all rules. 'Globals are evil' is one of them. A RNG should be a global object. (Caveat: each thread should get its own RNG!) I tend to wrap mine in a singleton map, but simply seeding and warming one up at the beginning of main() suffices:
std::mt19937 rng;
int main()
{
// (seed global object 'rng' here)
rng.dispose(10000); // warm it up
For your usage scenario (generating multiple RNs per call), you shouldn't have any problem creating a local distribution for each function call.
One other thing: std::random_device is not your friend -- it can throw at any time for all kinds of stupid reasons. Make sure to wrap it up in a try..catch block. Or, and I recommend this, use a platform specific way to get a true random number. (On Windows, use the Crypto API. On everything else, use /dev/urandom/.)
Hope this helps.

You shouldn't need to pass anything or declare anything, as the interaction between mt19937 and uniform_real_distribution is through globals.
std::array<int, std::mt19937::state_size> seed_data;
std::random_device r;
std::generate(seed_data.begin(), seed_data.end(), std::ref(r));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data)); //perform warmup
std::mt19937 eng(seq);
B()
...
void B()
{
std::uniform_real_distribution<> randU(0,1);
...

Related

how is this for random number generation

so in my custom C++ library i wanted to create 2 reliable and simple functions, i for quickly generating a random number, and the other for doing the same thing but within a specified range, i came up with the following but would like to know if they are good:
int random()
{
std::random_device rd;
std::uniform_int_distribution<int>dist;
return dist(rd);
}
int randomWithinRange(int range)
{
std::random_device rd;
std::uniform_int_distribution<int>dist;
return (dist(rd) % range);
}
the functions work and compile alright, i just want to know how good they would be for long term inclusion

How to Understand C++11 random number generator

Those three lines of generating random number looks a bit tricky. It is hard to always remember those lines. Could someone please shed some light on it to make it easier to understand?
#include <random>
#include <iostream>
int main()
{
std::random_device rd; //1st line: Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); //2nd line: Standard mersenne_twister_engine seeded with rd()
std::uniform_int_distribution<> dis(1, 6);
for (int n=0; n<10; ++n)
std::cout << dis(gen) << ' '; //3rd line: Use dis to transform the random unsigned int generated by gen into an int in [1, 6]
std::cout << '\n';
}
Here are some questions I can think of:
1st line of code:
random_device is a class as described by the documentation random_device, so this line means declaring a object rd? If yes, why in 2nd line we pass rd() to construct mt19937 instead of using the object rd (without parentheses)?
3rd line of code:
Why do call class uniform_int_distribution<> object dis()? Is dis() a function? Why shall we pass in gen object into dis()?

random_device is slow but genuinely random, it's used to generate the 'seed' for the random number sequence.
mt19937 is fast but only 'pseudo random'. It needs a 'seed' to start generating a sequence of numbers. That seed can be random (as in your example) so you get a different sequence of random numbers each time. But it could be a constant, so you get the same sequence of numbers each time.
uniform_int_distribution is a way of mapping random numbers (which could have any values) to the numbers you're actually interested in, in this case a uniform distribution of integers from 1 to 6.
As is often the case with OO programming, this code is about division of responsibilities. Each class contributes a small piece to the overall requirement (the generation of dice rolls). If you wanted to do something different it's easy because you've got all the pieces in front of you.
If this is too much then all you need to do is write a function to capture the overall effect, for instance
int dice_roll()
{
static std::random_device rd;
static std::mt19937 gen(rd());
static std::uniform_int_distribution<> dis(1, 6);
return dis(gen);
}
dis is an example of a function object or functor. It's an object which overloads operator() so it can be called as if it was a function.

std::random_device rd; // create access to truly random numbers
std::mt19937 gen{rd()}; // create pseudo random generator.
// initialize its seed to truly random number.
std::uniform_int_distribution<> dis{1, 6}; // define distribution
...
auto x = dis(gen); // generate pseudo random number form `gen`
// and transform its result to desired distribution `dis`.

Why is this random number generator generating same numbers?

The first one works, but the second one always returns the same value. Why would this happen and how am I supposed to fix this?
int main() {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0, 1);
for(int i = 0; i < 10; i++) {
std::cout << dis(gen) << std::endl;
}return 0;
}
The one dosen't work:
double generateRandomNumber() {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0, 1);
return dis(gen);
}
int main() {
for(int i = 0; i < 10; i++) {
std::cout << generateRandomNumber() << std::endl;
}return 0;
}

What platform are you working on? std::random_device is allowed to be a pseudo-RNG if hardware or OS functionality to generate random numbers doesn't exist. It might initialize using the current time, in which case the intervals at which you're calling it might be too close apart for the 'current time' to take on another value.
Nevertheless, as mentioned in the comments, it is not meant to be used this way. A simple fix will be to declare rd and gen as static. A proper fix would be to move the initialization of the RNG out of the function that requires the random numbers, so it can also be used by other functions that require random numbers.

The first one uses the same generator for all the numbers, the second creates a new generator for each number.

Let's compare the differences between your two cases and see why this happening.
Case 1:
int main() {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0, 1);
for(int i = 0; i < 10; i++) {
std::cout << dis(gen) << std::endl;
}return 0;
}
In your first case the program executes the main function and the first thing that happens here is that you are creating an instance of a std::random_device, std::mt19337 and a std::uniform_real_distribution<> on the stack that belong to main()'s scope. Your mersenne twister gen is initialized once with the result from your random device rd. You have also initialized your distribution dis to have the range of values from 0 to 1. These only exist once per each run of your application.
Now you create a for loop that starts at index 0 and increments to 9 and on each iteration you are displaying the resulting value to cout by using the distribution dis's operator()() passing to it your already seeded generation gen. Each time on this loop dis(gen) is going to produce a different value because gen was already seeded only once.
Case 2:
double generateRandomNumber() {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0, 1);
return dis(gen);
}
int main() {
for(int i = 0; i < 10; i++) {
std::cout << generateRandomNumber() << std::endl;
}return 0;
}
In this version of the code let's see what's similar and what's different. Here the program executes and enters the main() function. This time the first thing it encounters is a for loop from 0 to 9 similar as in the main above however this loop is the first thing on main's stack. Then there is a call to cout to display results from a user defined function named generateRandomNumber(). This function is called a total of 10 times and each time you iterate through the for loop this function has its own stack memory that will be wound and unwound or created and destroyed.
Now let's jump execution into this user defined function named generateRandomNumber().
The code looks almost exactly the same as it did before when it was in main() directly but these variables live in generateRandomNumber()'s stack and have the life time of its scope instead. These variables will be created and destroyed each time this function goes in and out of scope. The other difference here is that this function also returns dis(gen).
Note: I'm not 100% sure if this will return a copy or not or if the compiler will end up doing some kind of optimizations, but returning by value usually results in a copy.
Finally when then function generateRandomNumber() returns and just before it goes completely out of scope where std::uniform_real_distribrution<>'s operator()() is being called and it goes into it's own stack and scope before returning back to main generateRandomNumber() ever so briefly and then back to main.
-Visualizing The Differences-
As you can see these two programs are quite different, very different to be exact. If you want more visual proof of them being different you can use any available online compiler to enter each program to where it shows you that program in assembly and compare the two assembly versions to see their ultimate differences.
Another way to visualize the difference between these two programs is not only to see their assembly equivalents but to step through each program line by line with a debugger and keep an eye on the stack calls and the winding and unwinding of them and keep an eye of all values as they become initialized, returned and destroyed.
-Assessment and Reasoning-
The reason the first one works as expected is because your random device, your generator and your distribution all have the life time of main and your generator is seeded only once with your random device and you only have one distribution that you are using each time in the for loop.
In your second version main doesn't know anything about any of that and all it knows is that it is going through a for loop and sending returned data from a user function to cout. Now each time it goes through the for loop this function is being called and it's stack as I said is being created and destroyed each time so all if its variables are being created and destroyed. So in this instance you are creating and destroying 10: rd, gen(rd()), and dis(0,1)s instances.
-Conclusion-
There is more to this than what I have described above and the other part that pertains to the behavior of your random number generators is what was mentioned by user Kane in his statement to you from his comment to your question:
From en.cppreference.com/w/cpp/numeric/random/random_device:
"std::random_device may be implemented in terms of an
implementation-defined pseudo-random number engine [...].
In this case each std::random_device object may generate
the same number sequence."
Each time you create and destroy you are seeding the generator over and over again with a new random_device however if your particular machine or OS doesn't have support for using random_device it can either end up using some arbitrary value as its seed value or it could end up using the system clock to generate a seed value.
So let's say it does end up using the system clock, the execution of main()'s for loop happens so fast that all of the work that is being done by the 10 calls to generateRandomNumber() is already executed before a few milliseconds have passed. So here the delta time is minimally small and negligible that it is generating the same seed value on each pass as well as it is generating the same values from the distributions.

Note that std::mt19937 gen(rd()) is very problematic. See this question, which says:
rd() returns a single unsigned int. This has at least 16 bits and probably 32. That's not enough to seed [this generator's huge state].
Using std::mt19937 gen(rd());gen() (seeding with 32 bits and looking at the first output) doesn't give a good output distribution. 7 and 13 can never be the first output. Two seeds produce 0. Twelve seeds produce 1226181350. (Link)
std::random_device can be, and sometimes is, implemented as a simple PRNG with a fixed seed. It might therefore produce the same sequence on every run. (Link)
Furthermore, random_device's approach to generating "nondeterministic" random numbers is "implementation-defined", and random_device allows the implementation to "employ a random number engine" if it can't generate "nondeterministic" random numbers due to "implementation limitations" ([rand.device]). (For example, under the C++ standard, an implementation might implement random_device using timestamps from the system clock, or using fast-moving cycle counters, since both are nondeterministic.)
An application should not blindly call random_device's generator (rd()) without also, at a minimum, calling the entropy() method, which gives an estimate of the implementation's entropy in bits.

Mersenne twister warm up vs. reproducibility

In my current C++11 project I need to perform M simulations. For each simulation m = 1, ..., M, I randomly generate a data set by using a std::mt19937 object, constructed as follows:
std::mt19937 generator(m);
DatasetFactory dsf(generator);
According to https://stackoverflow.com/a/15509942/1849221 and https://stackoverflow.com/a/14924350/1849221, the Mersenne Twister PRNG benefits from a warm up phase, which is currently absent in my code. I report for convenience the proposed snippet of code:
#include <random>
std::mt19937 get_prng() {
std::uint_least32_t seed_data[std::mt19937::state_size];
std::random_device r;
std::generate_n(seed_data, std::mt19937::state_size, std::ref(r));
std::seed_seq q(std::begin(seed_data), std::end(seed_data));
return std::mt19937{q};
}
The problem in my case is that I need reproducibility of results, i.e., among different executions, for each simulation, the data set has to be the same. That's the reason why in my current solution I use the current simulation to seed the Mersenne Twister PRNG. It seems to me that the usage of std::random_device prevents data from being the same (AFAIK, this is the exact purpose of std::random_device).
EDIT: by different executions I mean re-launching the executable.
How can I introduce the afore-mentioned warm up phase in my code without affecting reproducibility? Thanks.
Possible solution #1
Here's a tentative implementation based on the second proposal by #SteveJessop
#include <random>
std::mt19937 get_generator(unsigned int seed) {
std::minstd_rand0 lc_generator(seed);
std::uint_least32_t seed_data[std::mt19937::state_size];
std::generate_n(seed_data, std::mt19937::state_size, std::ref(lc_generator));
std::seed_seq q(std::begin(seed_data), std::end(seed_data));
return std::mt19937{q};
}
Possible solution #2
Here's a tentative implementation based on the joint contribution by #SteveJassop and #AndréNeve. The sha256 function is adapted from https://stackoverflow.com/a/10632725/1849221
#include <openssl/sha.h>
#include <sstream>
#include <iomanip>
#include <random>
std::string sha256(const std::string str) {
unsigned char hash[SHA256_DIGEST_LENGTH];
SHA256_CTX sha256;
SHA256_Init(&sha256);
SHA256_Update(&sha256, str.c_str(), str.size());
SHA256_Final(hash, &sha256);
std::stringstream ss;
for(int i = 0; i < SHA256_DIGEST_LENGTH; i++)
ss << std::hex << std::setw(2) << std::setfill('0') << (int)hash[i];
return ss.str();
}
std::mt19937 get_generator(unsigned int seed) {
std::string seed_str = sha256(std::to_string(seed));
std::seed_seq q(seed_str.begin(), seed_str.end());
return std::mt19937{q};
}
Compile with: -I/opt/ssl/include/ -L/opt/ssl/lib/ -lcrypto

Two options:
Follow the proposal you have, but instead of using std::random_device r; to generate your seed sequence for MT, use a different PRNG seeded with m. Choose one that doesn't suffer like MT does from needing a warmup when used with small seed data: I suspect an LCG will probably do. For massive overkill, you could even use a PRNG based on a secure hash. This is a lot like "key stretching" in cryptography, if you've heard of that. You could in fact use a standard key stretching algorithm, but you're using it to generate a long seed sequence rather than large key material.
Continue using m to seed your MT, but discard a large constant amount of data before starting the simulation. That is to say, ignore the advice to use a strong seed and instead run the MT long enough for it to reach a decent internal state. I don't know off-hand how much data you need to discard, but I expect the internet does.

I think that you only need to store the initial seed (in your case the std::uint_least32_t seed_data[std::mt19937::state_size] array) and the number n of warmup steps you made (eg. using discard(n) as mentioned) for each run/simulation you wish to reproduce.
With this information, you can always create a new MT instance, seed it with the previous seed_data and run it for the same n warmup steps. This will generate the same sequence of values onwards since the MT instance will have the same inner state when the warmup ends.
When you mention the std::random_device affecting reproducibility, I believe that in your code it is simply being used to generate the seed data. If you were using it as the source of random numbers itself, then you would not be able to have reproducible results. Since you are using it only to generate the seed there shouldn't be any problem. You just can't generate a new seed every time if you want to reproduce values!
From the definition of std::random_device:
"std::random_device is a uniformly-distributed integer random number generator that produces non-deterministic random numbers."
So if it's not deterministic you cannot reproduce the sequence of values produced by it. That being said, use it simply to generate good random seeds only to store them afterwards for the re-runs.
Hope this helps
EDIT :
After discussing with #SteveJessop, we arrived at the conclusion that a simple hash of the dataset (or part of it) would be sufficient to be used as a decent seed for the purpose you need. This allows for a deterministic way of generating the same seeds every time you run your simulations. As mentioned by #Steve, you will have to guarantee that the size of the hash isn't too small compared with std::mt19937::state_size. If it is too small, then you can concatenate the hashes of m, m+M, m+2M, ... until you have enough data, as he suggested.
I am posting the updated answer here as the idea of using a hash was mine, but I will upvote #SteveJessop's answer because he contributed to it.

A comment on one of the answers you link to indicates:
Coincidentally, the default C++11 seed_seq is the Mersenne Twister warmup sequence (although the existing implementations, libc++'s mt19937 for example, use a simpler warmup when a single-value seed is provided)
So you may be able to use your current fixed seeds with std::seed_seq to do the warm-up for you.
std::mt19937 get_prng(int seed) {
std::seed_seq q{seed, maybe, some, extra, fixed, values};
return std::mt19937{q};
}

Does std::mt19937 require warmup?

I've read that many pseudo-random number generators require many samples in ordered to be "warmed up". Is that the case when using std::random_device to seed std::mt19937, or can we expect that it's ready after construction? The code in question:
#include <random>
std::random_device rd;
std::mt19937 gen(rd());

Mersenne Twister is a shift-register based pRNG (pseudo-random number generator) and is therefore subject to bad seeds with long runs of 0s or 1s that lead to relatively predictable results until the internal state is mixed up enough.
However the constructor which takes a single value uses a complicated function on that seed value which is designed to minimize the likelihood of producing such 'bad' states. There's a second way to initialize mt19937 where you directly set the internal state, via an object conforming to the SeedSequence concept. It's this second method of initialization where you may need to be concerned about choosing a 'good' state or doing warmup.
The standard includes an object conforming to the SeedSequence concept, called seed_seq. seed_seq takes an arbitrary number of input seed values, and then performs certain operations on these values in order to produce a sequence of different values suitable for directly setting the internal state of a pRNG.
Here's an example of loading up a seed sequence with enough random data to fill the entire std::mt19937 state:
std::array<int, 624> seed_data;
std::random_device r;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
std::mt19937 eng(seq);
This ensures that the entire state is randomized. Also, each engine specifies how much data it reads from the seed_sequence so you may want to read the docs to find that info for whatever engine you use.
Although here I load up the seed_seq entirely from std::random_device, seed_seq is specified such that just a few numbers that aren't particularly random should work well. For example:
std::seed_seq seq{1, 2, 3, 4, 5};
std::mt19937 eng(seq);
In the comments below Cubbi indicates that seed_seq works by performing a warmup sequence for you.
Here's what should be your 'default' for seeding:
std::random_device r;
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()};
std::mt19937 rng(seed);

If you seed with just one 32-bit value, all you will ever get is one of the same 2^32 trajectories through state-space. If you use a PRNG with KiBs of state, then you should probably seed all of it. As described in the comments to #bames63' answer, using std::seed_seq is probably not a good idea if you want to init the whole state with random numbers. Sadly, std::random_device does not conform to the SeedSequence concept, but you can write a wrapper that does:
#include <random>
#include <iostream>
#include <algorithm>
#include <functional>
class random_device_wrapper {
std::random_device *m_dev;
public:
using result_type = std::random_device::result_type;
explicit random_device_wrapper(std::random_device &dev) : m_dev(&dev) {}
template <typename RandomAccessIterator>
void generate(RandomAccessIterator first, RandomAccessIterator last) {
std::generate(first, last, std::ref(*m_dev));
}
};
int main() {
auto rd = std::random_device{};
auto seedseq = random_device_wrapper{rd};
auto mt = std::mt19937{seedseq};
for (auto i = 100; i; --i)
std::cout << mt() << std::endl;
}
This works at least until you enable concepts. Depending on whether your compiler knows about SeedSequence as a C++20 concept, it may fail to work because we're supplying only the missing generate() method, nothing else. In duck-typed template programming, that code is sufficient, though, because the PRNG does not store the seed sequence object.

I believe there are situations where MT can be seeded "poorly" which results in non-optimal sequences. If I remember correctly, seeding with all zeroes is one such case. I would recommend you try to use the WELL generators if this is a serious issue for you. I believe they are more flexible - the quality of the seed does not matter as much. (Perhaps to answer your question more directly: it's probably more efficient to focus on seeding well as opposed to seeding poorly then trying to generate a bunch of samples to get the generator to an optimal state.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js