Mersenne twister warm up vs. reproducibility

Mersenne twister warm up vs. reproducibility - c++

In my current C++11 project I need to perform M simulations. For each simulation m = 1, ..., M, I randomly generate a data set by using a std::mt19937 object, constructed as follows:
std::mt19937 generator(m);
DatasetFactory dsf(generator);
According to https://stackoverflow.com/a/15509942/1849221 and https://stackoverflow.com/a/14924350/1849221, the Mersenne Twister PRNG benefits from a warm up phase, which is currently absent in my code. I report for convenience the proposed snippet of code:
#include <random>
std::mt19937 get_prng() {
std::uint_least32_t seed_data[std::mt19937::state_size];
std::random_device r;
std::generate_n(seed_data, std::mt19937::state_size, std::ref(r));
std::seed_seq q(std::begin(seed_data), std::end(seed_data));
return std::mt19937{q};
}
The problem in my case is that I need reproducibility of results, i.e., among different executions, for each simulation, the data set has to be the same. That's the reason why in my current solution I use the current simulation to seed the Mersenne Twister PRNG. It seems to me that the usage of std::random_device prevents data from being the same (AFAIK, this is the exact purpose of std::random_device).
EDIT: by different executions I mean re-launching the executable.
How can I introduce the afore-mentioned warm up phase in my code without affecting reproducibility? Thanks.
Possible solution #1
Here's a tentative implementation based on the second proposal by #SteveJessop
#include <random>
std::mt19937 get_generator(unsigned int seed) {
std::minstd_rand0 lc_generator(seed);
std::uint_least32_t seed_data[std::mt19937::state_size];
std::generate_n(seed_data, std::mt19937::state_size, std::ref(lc_generator));
std::seed_seq q(std::begin(seed_data), std::end(seed_data));
return std::mt19937{q};
}
Possible solution #2
Here's a tentative implementation based on the joint contribution by #SteveJassop and #AndréNeve. The sha256 function is adapted from https://stackoverflow.com/a/10632725/1849221
#include <openssl/sha.h>
#include <sstream>
#include <iomanip>
#include <random>
std::string sha256(const std::string str) {
unsigned char hash[SHA256_DIGEST_LENGTH];
SHA256_CTX sha256;
SHA256_Init(&sha256);
SHA256_Update(&sha256, str.c_str(), str.size());
SHA256_Final(hash, &sha256);
std::stringstream ss;
for(int i = 0; i < SHA256_DIGEST_LENGTH; i++)
ss << std::hex << std::setw(2) << std::setfill('0') << (int)hash[i];
return ss.str();
}
std::mt19937 get_generator(unsigned int seed) {
std::string seed_str = sha256(std::to_string(seed));
std::seed_seq q(seed_str.begin(), seed_str.end());
return std::mt19937{q};
}
Compile with: -I/opt/ssl/include/ -L/opt/ssl/lib/ -lcrypto

Two options:
Follow the proposal you have, but instead of using std::random_device r; to generate your seed sequence for MT, use a different PRNG seeded with m. Choose one that doesn't suffer like MT does from needing a warmup when used with small seed data: I suspect an LCG will probably do. For massive overkill, you could even use a PRNG based on a secure hash. This is a lot like "key stretching" in cryptography, if you've heard of that. You could in fact use a standard key stretching algorithm, but you're using it to generate a long seed sequence rather than large key material.
Continue using m to seed your MT, but discard a large constant amount of data before starting the simulation. That is to say, ignore the advice to use a strong seed and instead run the MT long enough for it to reach a decent internal state. I don't know off-hand how much data you need to discard, but I expect the internet does.

I think that you only need to store the initial seed (in your case the std::uint_least32_t seed_data[std::mt19937::state_size] array) and the number n of warmup steps you made (eg. using discard(n) as mentioned) for each run/simulation you wish to reproduce.
With this information, you can always create a new MT instance, seed it with the previous seed_data and run it for the same n warmup steps. This will generate the same sequence of values onwards since the MT instance will have the same inner state when the warmup ends.
When you mention the std::random_device affecting reproducibility, I believe that in your code it is simply being used to generate the seed data. If you were using it as the source of random numbers itself, then you would not be able to have reproducible results. Since you are using it only to generate the seed there shouldn't be any problem. You just can't generate a new seed every time if you want to reproduce values!
From the definition of std::random_device:
"std::random_device is a uniformly-distributed integer random number generator that produces non-deterministic random numbers."
So if it's not deterministic you cannot reproduce the sequence of values produced by it. That being said, use it simply to generate good random seeds only to store them afterwards for the re-runs.
Hope this helps
EDIT :
After discussing with #SteveJessop, we arrived at the conclusion that a simple hash of the dataset (or part of it) would be sufficient to be used as a decent seed for the purpose you need. This allows for a deterministic way of generating the same seeds every time you run your simulations. As mentioned by #Steve, you will have to guarantee that the size of the hash isn't too small compared with std::mt19937::state_size. If it is too small, then you can concatenate the hashes of m, m+M, m+2M, ... until you have enough data, as he suggested.
I am posting the updated answer here as the idea of using a hash was mine, but I will upvote #SteveJessop's answer because he contributed to it.

A comment on one of the answers you link to indicates:
Coincidentally, the default C++11 seed_seq is the Mersenne Twister warmup sequence (although the existing implementations, libc++'s mt19937 for example, use a simpler warmup when a single-value seed is provided)
So you may be able to use your current fixed seeds with std::seed_seq to do the warm-up for you.
std::mt19937 get_prng(int seed) {
std::seed_seq q{seed, maybe, some, extra, fixed, values};
return std::mt19937{q};
}

Related

C++: How to generate random numbers while excluding numbers from a given cache

So in c++ I'm using mt19937 engine and the uniform_int_distribution in my random number generator like so:
#include <random>
#include <time.h>
int get_random(int lwr_lm, int upper_lm){
std::mt19937 mt(time(nullptr));
std::uniform_int_distribution<int> dist(lwr_lm, upper_lm);
return dist(mt);
}
What I need is to alter the above generator such that there is a cache that contains a number of integers I need to be excluded when I use the above generator over and over again.
How do I alter the above such that I can achieve this?

There are many ways to do it. A simple way would be to maintain your "excluded numbers" in a std::set and after each generation of a random number, check whether it is in the set and if it is then generate a new random number - repeat until you get a number that was not in the set, then return that.
Btw; while distributions are cheap to construct, engines are not. You don't want to re-construct your mt19937 every time the function is called, but instead create it once and then re-use it. You probably also want to use a better seed than the current time in seconds.

Are you 1) attempting to sample without replacement in the discrete interval? Or is it 2) a patchy distribution over the interval that says fairly constant?
If 1) you could use std::shuffle as per the answer here How to sample without replacement using c++ uniform_int_distribution
If 2) you could use std::discrete_distribution (element 0 corresponding to lwr_lm) and weight zero the numbers you don't want. Obviously the memory requirements are linear in upper_lm-lwr_lm so might not be practical if this is large

I would propose two similar solutions for the problem. They are based upon probabilistic structures, and provide you with the answer "potentially in cache" or "definitely not in cache". There are false positives but no false negatives.
Perfect hash function. There are many implementations, including one from GNU. Basically, run it on set of cache values, and use generated perfect hash functions to reject sampled values. You don't even need to maintain hash table, just function mapping random value to integer index. As soon as index is in the hash range, reject the number. Being perfect means you need only one call to check and result will tell you that number is in the set. There are potential collisions, so false positives are possible.
Bloom filter. Same idea, build filter with whatever bits per cache item you're willing to spare, and with quick check you either will get "possible in the cache" answer or clear negative. You could trade answer precision for memory and vice versa. False positives are possible

As mentioned by #virgesmith, in his answer, it might be better solution in function of your problem.
The method with a cache and uses it to filter future generation is inefficient for large range wiki.
Here I write a naive example with a different method, but you will be limited by your memory. You pick random number for a buffer and remove it for next iteration.
#include <random>
#include <time.h>
#include <iostream>
int get_random(int lwr_lm, int upper_lm, std::vector<int> &buff, std::mt19937 &mt){
if (buff.size() > 0) {
std::uniform_int_distribution<int> dist(0, buff.size()-1);
int tmp_index = dist(mt);
int tmp_value = buff[tmp_index];
buff.erase(buff.begin() + tmp_index);
return tmp_value;
} else {
return 0;
}
}
int main() {
// lower and upper limit for random distribution
int lower = 0;
int upper = 10;
// Random generator
std::mt19937 mt(time(nullptr));
// Buffer to filter and avoid duplication, Buffer contain all integer between lower and uper limit
std::vector<int> my_buffer(upper-lower);
std::iota(my_buffer.begin(), my_buffer.end(), lower);
for (int i = 0; i < 20; ++i) {
std::cout << get_random(lower, upper, my_buffer, mt) << std::endl;
}
return 0;
}
Edit: a cleaner solution here

It might not be the prettiest solution, but what's stopping you from maintaining that cache and checking existence before returning? It will slow down for large caches though.
#include <random>
#include <time.h>
#include <set>
std::set<int> cache;
int get_random(int lwr_lm, int upper_lm){
std::mt19937 mt(time(nullptr));
std::uniform_int_distribution<int> dist(lwr_lm, upper_lm);
auto r = dist(mt);
while(cache.find(r) != cache.end())
r = dist(mt);
return r;
}

Using same random number generator across multiple functions

I am given to believe that random number generators (RNGs) should only be seeded once to ensure that the distribution of results is as intended.
I am writing a Monte Carlo simulation in C++ which consists of a main function ("A") calling another function ("B") several times, where a large quantity of random numbers is generated in B.
Currently, I am doing the following in B:
void B(){
std::array<int, std::mt19937::state_size> seed_data;
std::random_device r;
std::generate(seed_data.begin(), seed_data.end(), std::ref(r));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data)); //perform warmup
std::mt19937 eng(seq);
std::uniform_real_distribution<> randU(0,1);
double myRandNum = randU(eng);
//do stuff with my random number
}
As you can see, I am creating a new random number generator each time I call the function B. This, as far as I can see, is a waste of time - the RNG can still generate a lot more random numbers!
I have experimented with making "eng" extern but this generates an error using g++:
error: ‘eng’ has both ‘extern’ and initializer extern std::mt19937 eng(seq);
How can I make the random number generator "global" so that I can use it many times?

Be careful with one-size-fits-all rules. 'Globals are evil' is one of them. A RNG should be a global object. (Caveat: each thread should get its own RNG!) I tend to wrap mine in a singleton map, but simply seeding and warming one up at the beginning of main() suffices:
std::mt19937 rng;
int main()
{
// (seed global object 'rng' here)
rng.dispose(10000); // warm it up
For your usage scenario (generating multiple RNs per call), you shouldn't have any problem creating a local distribution for each function call.
One other thing: std::random_device is not your friend -- it can throw at any time for all kinds of stupid reasons. Make sure to wrap it up in a try..catch block. Or, and I recommend this, use a platform specific way to get a true random number. (On Windows, use the Crypto API. On everything else, use /dev/urandom/.)
Hope this helps.

You shouldn't need to pass anything or declare anything, as the interaction between mt19937 and uniform_real_distribution is through globals.
std::array<int, std::mt19937::state_size> seed_data;
std::random_device r;
std::generate(seed_data.begin(), seed_data.end(), std::ref(r));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data)); //perform warmup
std::mt19937 eng(seq);
B()
...
void B()
{
std::uniform_real_distribution<> randU(0,1);
...

How to use <random> to replace rand()?

C++11 introduced the header <random> with declarations for random number engines and random distributions. That's great - time to replace those uses of rand() which is often problematic in various ways. However, it seems far from obvious how to replace
srand(n);
// ...
int r = rand();
Based on the declarations it seems a uniform distribution can be built something like this:
std::default_random_engine engine;
engine.seed(n);
std::uniform_int_distribution<> distribution;
auto rand = [&](){ return distribution(engine); }
This approach seems rather involved and is surely something I won't remember unlike the use of srand() and rand(). I'm aware of N4531 but even that still seems to be quite involved.
Is there a reasonably simple way to replace srand() and rand()?

Is there a reasonably simple way to replace srand() and rand()?
Full disclosure: I don't like rand(). It's bad, and it's very easily abused.
The C++11 random library fills in a void that has been lacking for a long, long time. The problem with high quality random libraries is that they're oftentimes hard to use. The C++11 <random> library represents a huge step forward in this regard. A few lines of code and I have a very nice generator that behaves very nicely and that easily generates random variates from many different distributions.
Given the above, my answer to you is a bit heretical. If rand() is good enough for your needs, use it. As bad as rand() is (and it is bad), removing it would represent a huge break with the C language. Just make sure that the badness of rand() truly is good enough for your needs.
C++14 didn't deprecate rand(); it only deprecated functions in the C++ library that use rand(). While C++17 might deprecate rand(), it won't delete it. That means you have several more years before rand() disappears. The odds are high that you will have retired or switched to a different language by the time the C++ committee finally does delete rand() from the C++ standard library.
I'm creating random inputs to benchmark different implementations of std::sort() using something along the lines of std::vector<int> v(size); std::generate(v.begin(), v.end(), std::rand);
You don't need a cryptographically secure PRNG for that. You don't even need Mersenne Twister. In this particular case, rand() probably is good enough for your needs.
Update
There is a nice simple replacement for rand() and srand() in the C++11 random library: std::minstd_rand.
#include <random>
#include <iostream>
int main ()
{
std:: minstd_rand simple_rand;
// Use simple_rand.seed() instead of srand():
simple_rand.seed(42);
// Use simple_rand() instead of rand():
for (int ii = 0; ii < 10; ++ii)
{
std::cout << simple_rand() << '\n';
}
}
The function std::minstd_rand::operator()() returns a std::uint_fast32_t. However, the algorithm restricts the result to between 1 and 231-2, inclusive. This means the result will always convert safely to a std::int_fast32_t (or to an int if int is at least 32 bits long).

How about randutils by Melissa O'Neill of pcg-random.org?
From the introductory blog post:
randutils::mt19937_rng rng;
std::cout << "Greetings from Office #" << rng.uniform(1,17)
<< " (where we think PI = " << rng.uniform(3.1,3.2) << ")\n\n"
<< "Our office morale is " << rng.uniform('A','D') << " grade\n";

Assuming you want the behavior of the C-style rand and srand functions, including their quirkiness, but with good random, this is the closest I could get.
#include <random>
#include <cstdlib> // RAND_MAX (might be removed soon?)
#include <climits> // INT_MAX (use as replacement?)
namespace replacement
{
constexpr int rand_max {
#ifdef RAND_MAX
RAND_MAX
#else
INT_MAX
#endif
};
namespace detail
{
inline std::default_random_engine&
get_engine() noexcept
{
// Seeding with 1 is silly, but required behavior
static thread_local auto rndeng = std::default_random_engine(1);
return rndeng;
}
inline std::uniform_int_distribution<int>&
get_distribution() noexcept
{
static thread_local auto rnddst = std::uniform_int_distribution<int> {0, rand_max};
return rnddst;
}
} // namespace detail
inline int
rand() noexcept
{
return detail::get_distribution()(detail::get_engine());
}
inline void
srand(const unsigned seed) noexcept
{
detail::get_engine().seed(seed);
detail::get_distribution().reset();
}
inline void
srand()
{
std::random_device rnddev {};
srand(rnddev());
}
} // namespace replacement
The replacement::* functions can be used exactly like their std::* counterparts from <cstdlib>. I have added a srand overload that takes no arguments and seeds the engine with a “real” random number obtained from a std::random_device. How “real” that randomness will be is of course implementation defined.
The engine and the distribution are held as thread_local static instances so they carry state across multiple calls but still allow different threads to observe predictable sequences. (It's also a performance gain because you don't need to re-construct the engine or use locks and potentially trash other people's cashes.)
I've used std::default_random_engine because you did but I don't like it very much. The Mersenne Twister engines (std::mt19937 and std::mt19937_64) produce much better “randomness” and, surprisingly, have also been observed to be faster. I don't think that any compliant program must rely on std::rand being implemented using any specific kind of pseudo random engine. (And even if it did, implementations are free to define std::default_random_engine to whatever they like so you'd have to use something like std::minstd_rand to be sure.)

Abusing the fact that engines return values directly
All engines defined in <random> has an operator()() that can be used to retrieve the next generated value, as well as advancing the internal state of the engine.
std::mt19937 rand (seed); // or an engine of your choosing
for (int i = 0; i < 10; ++i) {
unsigned int x = rand ();
std::cout << x << std::endl;
}
It shall however be noted that all engines return a value of some unsigned integral type, meaning that they can potentially overflow a signed integral (which will then lead to undefined-behavior).
If you are fine with using unsigned values everywhere you retrieve a new value, the above is an easy way to replace usage of std::srand + std::rand.
Note: Using what has been described above might lead to some values having a higher chance of being returned than others, due to the fact that the result_type of the engine not having a max value that is an even multiple of the highest value that can be stored in the destination type.
If you have not worried about this in the past — when using something like rand()%low+high — you should not worry about it now.
Note: You will need to make sure that the std::engine-type::result_type is at least as large as your desired range of values (std::mt19937::result_type is uint_fast32_t).
If you only need to seed the engine once
There is no need to first default-construct a std::default_random_engine (which is just a typedef for some engine chosen by the implementation), and later assigning a seed to it; this could be done all at once by using the appropriate constructor of the random-engine.
std::random-engine-type engine (seed);
If you however need to re-seed the engine, using std::random-engine::seed is the way to do it.
If all else fails; create a helper-function
Even if the code you have posted looks slightly complicated, you are only meant to write it once.
If you find yourself in a situation where you are tempted to just copy+paste what you have written to several places in your code it is recommended, as always when doing copy+pasting; introduce a helper-function.
Intentionally left blank, see other posts for example implementations.

You can create a simple function like this:
#include <random>
#include <iostream>
int modernRand(int n) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, n);
return dis(gen);
}
And later use it like this:
int myRandValue = modernRand(n);
As mentioned here

Does std::mt19937 require warmup?

I've read that many pseudo-random number generators require many samples in ordered to be "warmed up". Is that the case when using std::random_device to seed std::mt19937, or can we expect that it's ready after construction? The code in question:
#include <random>
std::random_device rd;
std::mt19937 gen(rd());

Mersenne Twister is a shift-register based pRNG (pseudo-random number generator) and is therefore subject to bad seeds with long runs of 0s or 1s that lead to relatively predictable results until the internal state is mixed up enough.
However the constructor which takes a single value uses a complicated function on that seed value which is designed to minimize the likelihood of producing such 'bad' states. There's a second way to initialize mt19937 where you directly set the internal state, via an object conforming to the SeedSequence concept. It's this second method of initialization where you may need to be concerned about choosing a 'good' state or doing warmup.
The standard includes an object conforming to the SeedSequence concept, called seed_seq. seed_seq takes an arbitrary number of input seed values, and then performs certain operations on these values in order to produce a sequence of different values suitable for directly setting the internal state of a pRNG.
Here's an example of loading up a seed sequence with enough random data to fill the entire std::mt19937 state:
std::array<int, 624> seed_data;
std::random_device r;
std::generate_n(seed_data.data(), seed_data.size(), std::ref(r));
std::seed_seq seq(std::begin(seed_data), std::end(seed_data));
std::mt19937 eng(seq);
This ensures that the entire state is randomized. Also, each engine specifies how much data it reads from the seed_sequence so you may want to read the docs to find that info for whatever engine you use.
Although here I load up the seed_seq entirely from std::random_device, seed_seq is specified such that just a few numbers that aren't particularly random should work well. For example:
std::seed_seq seq{1, 2, 3, 4, 5};
std::mt19937 eng(seq);
In the comments below Cubbi indicates that seed_seq works by performing a warmup sequence for you.
Here's what should be your 'default' for seeding:
std::random_device r;
std::seed_seq seed{r(), r(), r(), r(), r(), r(), r(), r()};
std::mt19937 rng(seed);

If you seed with just one 32-bit value, all you will ever get is one of the same 2^32 trajectories through state-space. If you use a PRNG with KiBs of state, then you should probably seed all of it. As described in the comments to #bames63' answer, using std::seed_seq is probably not a good idea if you want to init the whole state with random numbers. Sadly, std::random_device does not conform to the SeedSequence concept, but you can write a wrapper that does:
#include <random>
#include <iostream>
#include <algorithm>
#include <functional>
class random_device_wrapper {
std::random_device *m_dev;
public:
using result_type = std::random_device::result_type;
explicit random_device_wrapper(std::random_device &dev) : m_dev(&dev) {}
template <typename RandomAccessIterator>
void generate(RandomAccessIterator first, RandomAccessIterator last) {
std::generate(first, last, std::ref(*m_dev));
}
};
int main() {
auto rd = std::random_device{};
auto seedseq = random_device_wrapper{rd};
auto mt = std::mt19937{seedseq};
for (auto i = 100; i; --i)
std::cout << mt() << std::endl;
}
This works at least until you enable concepts. Depending on whether your compiler knows about SeedSequence as a C++20 concept, it may fail to work because we're supplying only the missing generate() method, nothing else. In duck-typed template programming, that code is sufficient, though, because the PRNG does not store the seed sequence object.

I believe there are situations where MT can be seeded "poorly" which results in non-optimal sequences. If I remember correctly, seeding with all zeroes is one such case. I would recommend you try to use the WELL generators if this is a serious issue for you. I believe they are more flexible - the quality of the seed does not matter as much. (Perhaps to answer your question more directly: it's probably more efficient to focus on seeding well as opposed to seeding poorly then trying to generate a bunch of samples to get the generator to an optimal state.)

Generating two independent random number sequences (C++)

I'd like to be able to do something like this (obviously not valid C++):
rng1 = srand(x)
rng2 = srand(y)
//rng1 and rng2 give me two separate sequences of random numbers
//based on the srand seed
rng1.rand()
rng2.rand()
Is there any way to do something like this in C++? For example in Java I can create two java.util.Random objects with the seeds I want. It seems there is only a single global random number generator in C++. I'm sure there are libraries that provide this functionality, but anyway to do it with just C++?

Use rand_r.

In TR1 (and C++0x), you could use the tr1/random header. It should be built-in for modern C++ compilers (at least for g++ and MSVC).
#include <tr1/random>
// use #include <random> on MSVC
#include <iostream>
int main() {
std::tr1::mt19937 m1 (1234); // <-- seed x
std::tr1::mt19937 m2 (5678); // <-- seed y
std::tr1::uniform_int<int> distr(0, 100);
for (int i = 0; i < 20; ++ i) {
std::cout << distr(m1) << "," << distr(m2) << std::endl;
}
return 0;
}

You could also use Boost.Random.
More technical documentation here.

I just want to point out, that using different seeds may not give you statistically independent random sequences. mt19937 is an exception. Two mt19937 objects initialized with different seeds will give you more or less (depending whom you ask) statistically independent sequences with very high probability (there is a small chance that the sequences will overlap). Java's standard RNG is notoriously bad. There are plenty of implementations of mt19937 for Java, which should be preferred over the stock RNG.

as #James McNellis said, I can't imagine why do you would do that, and what pros you will get. Describe what effect you would like to achieve.

For whatever reason, the following generators interfere with each other. I need two independent generators for a task and need to reconstruct the streams. I haven't dug into the code but the std::tr1 and C++11 generators seem to share states in common. Adding m2 below changes what m1 will deliver.
std::tr1::mt19937 m1 (1234); // <-- seed x
std::tr1::mt19937 m2 (5678); // <-- seed y
I had to build my own to ensure independence.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js