There seems to be some mythology around the use of mt19937, specifically that once seeded 'some' number of bits produced by the generator should be ignored so as to have only as near as is possible to pseudo randomness.
Examples of code I've seen are as follows:
boost::mt19937::result_type seed = 1234567; //taken from some entropy pool etc
boost::mt19937 prng(seed);
boost::uniform_int<unsigned int> dist(0,1000);
boost::variate_generator<boost::mt19937&,boost::uniform_int<unsigned int> > generator(prng,dist);
unsigned int skip = 10000;
while (skip--)
{
generator();
}
//now begin using for real.
....
My questions are:
Is this myth or is there some truth to it all?
If it's something viable, how many bits should be ignored? as the numbers I've seen
seem to be arbitrary
The paper referenced in the first comment, Mersenne Twister with improved initialization, isn't just some guy, he's one of the two co-authors of the paper upon which the Boost implementation is based.
The problem with using a single 32-bit integer (4 bytes) as a seed for this generator is that the internal state of the generator is 2496 bytes, according to the Boost documentation. It's not too surprising that such a small seed takes a while to propagate into the rest of the internal state of the generator, particular since the Twister isn't meant to be cryptographically secure.
To address your concern about needing to run the generator for a while to get started, you want the alternate (and explicit) constructor.
template<typename SeedSeq> explicit mersenne_twister_engine(SeedSeq &);
This is the spirit of the third comment, where you initialize with something longer than a single integer. The sequence provide comes from some generator. To use an entropy pool, write a generator as an adapter from an entropy pool, returning values from the pool as needed.
Related
I'm using std::mt19937 to produce deterministic random numbers. I'd like to pass it to functions so I can control their source of randomness. I could do int foo(std::mt19937& rng);, but I want to call foo and bar in parallel, so that won't work. Even if I put the generation function behind a mutex (so each call to operator() did std::lock_guard lock(mutex); return rng();), calling foo and bar in parallel wouldn't be deterministic due to the race on the mutex.
I feel like conceptually I should be able to do this:
auto fooRNG = std::mt19937(rng()); // Seed a RNG with the output of `rng`.
auto barRNG = std::mt19937(rng());
parallel_invoke([&] { fooResult = foo(fooRNG); },
[&] { barResult = bar(barRNG); });
where I "fork" rng into two new ones with different seeds. Since fooRNG and barRNG are seeded deterministically, they should be random and independent.
Is this general gist viable?
Is this particular implementation sufficient (I doubt it)?
Extended question: Suppose I want to call baz(int n, std::mt19937&) massively in parallel over a range of indexed values, something like
auto seed = rng();
parallel_for(range(0, 1 << 20),
[&](int i) {
auto thisRNG = std::mt19937(seed ^ i); // Deterministically set up RNGs in parallel?
baz(i, thisRNG);
});
something like that should work, right? That is, provided we give it enough bits of state?
Update:
Looking into std::seed_seq, it looks(?) like it's designed to turn not-so-random seeds into high-quality seeds: How to properly initialize a C++11 std::seed_seq
So maybe what I want something like
std::mt19937 fork(std::mt19937& rng) {
return std::mt19937(std::seed_seq({rng()}));
}
or more generally:
//! Represents a state that can be used to generate multiple
//! distinct deterministic child std::mt19937 instances.
class rng_fork {
std::mt19937::result_type m_seed;
public:
rng_fork(std::mt19937& rng) : m_seed(rng()) {}
// Copy is explicit b/c I think it's a correctness footgun:
explicit rng_fork(const rng_fork&) = default;
//! Make the ith fork: a deterministic but well-seeded
//! RNG based off the internal seed and the given index:
std::mt19937 ith_fork(std::mt19937::result_type index) const {
return std::mt19937(std::seed_seq({m_seed, index}));
}
};
then the initial examples would become
auto fooRNG = fork(rng);
auto barRNG = fork(rng);
parallel_invoke([&] { fooResult = foo(fooRNG); },
[&] { barResult = bar(barRNG); });
and
auto fork_point = rng_fork{rng};
parallel_for(range(0, 1 << 20),
[&](int i) {
auto thisRNG = fork_point.ith_fork(i); // Deterministically set up a RNG in parallel.
baz(i, thisRNG);
});
Is that correct usage of std::seed_seq?
I am aware of 3 ways to seed multiple parallel pseudo random number generators (PRNGs):
First option
Given a seed, initialize the first instance of the PRNG with seed, the second with seed+1, etc. The thing to be aware of here is that the state of the PRNGs will be initially very close in case the seed is not hashed. Some PRNGs will take a long time to diverge. See e.g. this blog post for more information.
For std::mt19937 specifically, however, this was never an issue in my tests because the initial seed is not taken as is but instead gets "mangled/hashed" (compare the documentation of the result_type constructor). So it seems to be a viable option in practice.
However, notice that there are some potential pitfalls when seeding a Mersenne Twister (which has an internal state of 624 32-bit integers) with a single 32 bit integer. For example, the first number can never be 7 or 13. See this blog post for more information. But if you do not rely on the randomness of only the first few drawn numbers but draw a more reasonable number of numbers from each PRNG, it is probably fine.
Second option
Without std::seed_seq:
Seed one "parent" PRNG. Then, to initialize N parallel PRNGs, draw N random numbers and use them as seeds. This is your initial idea where you draw 2 random numbers rng() and initialize the two std::mt19937:
std::mt19937 & rng = ...;
auto fooRNG = std::mt19937(rng()); // Seed a RNG with the output of `rng`.
auto barRNG = std::mt19937(rng());
The major issue to look out for here is the birthday problem. It essentially states that the probability to draw the same number twice is more likely than you'd intuitively think. Given a type of PRNG that has a value range of b (i.e. b different values can appear in its output), the probability p(t) to draw the same number twice when drawing t numbers can be estimated as:
p(t) ~ t^2 / (2b) for t^2 << b
(compare this post). If we stretch the estimate "a bit", just to show the basic issue:
For a PRNG producing a 16 bit integer, we have b=2^16. Drawing 256 numbers results in a 50% chance to draw the same number twice according to that formula. For a 32 bit PRNG (such as std::mt19937) we need to draw 65536 numbers, and for a 64 bit integer PRNG we need to draw ~4e9 numbers to reach the 50%. Of course, this is all an estimate, so you want to draw several orders of magnitude less numbers than that. Also see this blog post for more information.
In case of seeding the parallel std::m19937 instances with this method (32 bit output and input!), that means you probably do not want to draw more than a hundred or so random numbers. Otherwise, you have a realistic chance of drawing the same number twice. Of course, you could ensure that you do not draw the same seed twice by keeping a list of already used seeds. Or use std::mt19937_64.
Additionally, there are still the potential pitfalls mentioned above regarding the seeding of a Mersenne Twister with 32 bit numbers.
With seed sequence:
The idea of std::seed_seq is to take some numbers, "mix them" and then provide them as input to the PRNG so that it can initialize its state. Since the 32 bit Mersenne Twister has a state of 624 32-bit integers, you should provide that many numbers to the seed sequence for theoretically optimal results. That way you get b=2^(624*32), meaning that you avoid the birthday problem for all practical purposes.
But in your example
std::mt19937 fork(std::mt19937& rng) {
return std::mt19937(std::seed_seq({rng()}));
}
you provide only a single 32 bit integer. This effectively means that you hash that 32 bit number before putting it into std::mt19937. So you do not gain anything regarding the birthday problem. And the additional hashing is unnecessary because std::mt19937 already does something like this.
std::seed_seq itself is somewhat flawed, see this blog post. But I guess for practical purposes it does not really matter. A supposedly better alternative exists, but I have no experience with it.
Third option
Some PRNG algorithms such as PCG or xoshiro256++ allow to jump over a large number of random numbers fast. For example, xoshiro256++ has a period of (2^256)-1 before it repeats itself. It allows to jump ahead by 2^128 (or alternatively 2^192) numbers. So the idea would be that the first PRNG is seeded, then you create a copy of it and jump ahead by 2^128 numbers, then create a copy of that second one and jump ahead again by 2^128, etc. So each instance works in a slice of length 2^128 from the total range of 2^256. The slices are stochastically independent. This elegantly bypasses the problems with the above methods.
The standard PRNGs do have a discard(z) method to jump z values ahead. However, it is not guaranteed that the jumping will be fast. I don't know whether std::mt19937 implements fast jumping in all standard library implementations. (As far as I know, the Mersenne Twister algorithm itself does allow this in principle.)
Additional note
I found PRNGs to be surprisingly difficult to use "right". It really depends on the use case how careful you need to be and what method to choose. Think about the worst thing that could happen in your case if something goes wrong, and invest an according amount of time in researching the topic.
For ordinary scientific simulations where you require only a few dozens or so parallel instances of std::mt19937, I'd guess that the first and second option (without seed sequence) are both viable. But if you need several hundreds or even more, you should think more carefully.
Inspired from this and the similar questions, I want to learn how does mt19937 pseudo-number generator in C++11 behaves, when in two separate machines, it is seeded with the same input.
In other words, say we have the following code;
std::mt19937 gen{ourSeed};
std::uniform_int_distribution<int> dest{0, 10000};
int randNumber = dist(gen);
If we try this code on different machines at different times, will we get the same sequence of randNumber values or a different sequence each time ?
And in either case, why this is the case ?
A further question:
Regardless of the seed, will this code generate randomly numbers infinitely ? I mean for example, if we use this block of code in a program that runs for months without stopping, will there be any problem in the generation of the number or in the uniformity of the numbers ?
The generator will generate the same values.
The distributions may not, at least with different compilers or library versions. The standard did not specify their behaviour to that level of detail. If you want stability between compilers and library versions, you have to roll your own distribution.
Barring library/compiler changes, that will return the same values in the same sequence. But if you care write your own distribution.
...
All PRNGs have patterns and periods. mt19937 is named after its period of 2^19937-1, which is unlikely to be a problem. But other patterns can develop. MT PRNGs are robust against many statistical tests, but they are not crytographically secure PRNGs.
So it being a problem if you run for months will depend on specific details of what you'd find to be a problem. However, mt19937 is going to be a better PRNG than anything you are likely to write yourself. But assume attackers can predict its future behaviour from past evidence.
Regardless of the seed, will this code generate randomly numbers infinitely ? I mean for example, if we use this block of code in a program that runs for months without stopping, will there be any problem in the generation of the number or in the uniformity of the numbers ?
RNG we deal with with standard C++ are called pseudo-random RNGs. By definition, this is pure computational device, with multi-bit state (you could think about state as large bit vector) and three functions:
state seed2state(seed);
state next_state(state);
uint(32|64)_t state2output(state);
and that is it. Obviously, state has finite size, 19937 bits in case of MT19937, so total number of states are 219937 and thus MT19937 next_state() function is a periodic one, with max period no more than 219937. This number is really HUGE, and most likely more than enough for typical simulation
But output is at max 64 bits, so output space is 264. It means that during large run any particular output appears quite a few times. What matters is when not only some 64bit number appears again, but number after that, and after that and after that - this is when you know RNG period is reached.
If we try this code on different machines at different times, will we get the same sequence of randNumber values or a different sequence each time?
Generators are defined rather strictly, and you'll get the same bit stream. For example for MT19937 from C++ standard (https://timsong-cpp.github.io/cppwp/rand)
class mersenne_twister_engine {
...
static constexpr result_type default_seed = 5489u;
...
and function seed2state described as (https://timsong-cpp.github.io/cppwp/rand#eng.mers-6)
Effects: Constructs a mersenne_twister_engine object. Sets X−n to value mod 2w. Then, iteratively for i=−n,…,−1, sets Xi to ...
Function next_state is described as well together with test value at 10000th invocation. Standard says (https://timsong-cpp.github.io/cppwp/rand#predef-3)
using mt19937 = mersenne_twister_engine<uint_fast32_t,32,624,397,31,0x9908b0df,11,0xffffffff,7,0x9d2c5680,15,0xefc60000,18,1812433253>;
3
#Required behavior: The 10000th consecutive invocation of a default-constructed object
of type mt19937 shall produce the value 4123659995.
Big four compilers (GCC, Clang, VC++, Intel C++) I used produced same MT19937 output.
Distributions, from the other hand, are not specified that well, and therefore vary between compilers and libraries. If you need portable distributions you either roll your own or use something from Boost or similar libraries
Any pseudo RNG which takes a seed will give you the same sequence for the same seed every time, on every machine. This happens since the generator is just a (complex) mathematical function, and has nothing actually random about it. Most times when you want to randomize, you take the seed from the system clock, which constantly changes so each run will be different.
It is useful to have the same sequence in computer games for example when you have a randomly generated world and want to generate the exact same one, or to avoid people cheating using save games in a game with random chances.
Inspired from this and the similar questions, I want to learn how does mt19937 pseudo-number generator in C++11 behaves, when in two separate machines, it is seeded with the same input.
In other words, say we have the following code;
std::mt19937 gen{ourSeed};
std::uniform_int_distribution<int> dest{0, 10000};
int randNumber = dist(gen);
If we try this code on different machines at different times, will we get the same sequence of randNumber values or a different sequence each time ?
And in either case, why this is the case ?
A further question:
Regardless of the seed, will this code generate randomly numbers infinitely ? I mean for example, if we use this block of code in a program that runs for months without stopping, will there be any problem in the generation of the number or in the uniformity of the numbers ?
The generator will generate the same values.
The distributions may not, at least with different compilers or library versions. The standard did not specify their behaviour to that level of detail. If you want stability between compilers and library versions, you have to roll your own distribution.
Barring library/compiler changes, that will return the same values in the same sequence. But if you care write your own distribution.
...
All PRNGs have patterns and periods. mt19937 is named after its period of 2^19937-1, which is unlikely to be a problem. But other patterns can develop. MT PRNGs are robust against many statistical tests, but they are not crytographically secure PRNGs.
So it being a problem if you run for months will depend on specific details of what you'd find to be a problem. However, mt19937 is going to be a better PRNG than anything you are likely to write yourself. But assume attackers can predict its future behaviour from past evidence.
Regardless of the seed, will this code generate randomly numbers infinitely ? I mean for example, if we use this block of code in a program that runs for months without stopping, will there be any problem in the generation of the number or in the uniformity of the numbers ?
RNG we deal with with standard C++ are called pseudo-random RNGs. By definition, this is pure computational device, with multi-bit state (you could think about state as large bit vector) and three functions:
state seed2state(seed);
state next_state(state);
uint(32|64)_t state2output(state);
and that is it. Obviously, state has finite size, 19937 bits in case of MT19937, so total number of states are 219937 and thus MT19937 next_state() function is a periodic one, with max period no more than 219937. This number is really HUGE, and most likely more than enough for typical simulation
But output is at max 64 bits, so output space is 264. It means that during large run any particular output appears quite a few times. What matters is when not only some 64bit number appears again, but number after that, and after that and after that - this is when you know RNG period is reached.
If we try this code on different machines at different times, will we get the same sequence of randNumber values or a different sequence each time?
Generators are defined rather strictly, and you'll get the same bit stream. For example for MT19937 from C++ standard (https://timsong-cpp.github.io/cppwp/rand)
class mersenne_twister_engine {
...
static constexpr result_type default_seed = 5489u;
...
and function seed2state described as (https://timsong-cpp.github.io/cppwp/rand#eng.mers-6)
Effects: Constructs a mersenne_twister_engine object. Sets X−n to value mod 2w. Then, iteratively for i=−n,…,−1, sets Xi to ...
Function next_state is described as well together with test value at 10000th invocation. Standard says (https://timsong-cpp.github.io/cppwp/rand#predef-3)
using mt19937 = mersenne_twister_engine<uint_fast32_t,32,624,397,31,0x9908b0df,11,0xffffffff,7,0x9d2c5680,15,0xefc60000,18,1812433253>;
3
#Required behavior: The 10000th consecutive invocation of a default-constructed object
of type mt19937 shall produce the value 4123659995.
Big four compilers (GCC, Clang, VC++, Intel C++) I used produced same MT19937 output.
Distributions, from the other hand, are not specified that well, and therefore vary between compilers and libraries. If you need portable distributions you either roll your own or use something from Boost or similar libraries
Any pseudo RNG which takes a seed will give you the same sequence for the same seed every time, on every machine. This happens since the generator is just a (complex) mathematical function, and has nothing actually random about it. Most times when you want to randomize, you take the seed from the system clock, which constantly changes so each run will be different.
It is useful to have the same sequence in computer games for example when you have a randomly generated world and want to generate the exact same one, or to avoid people cheating using save games in a game with random chances.
I saw a C++ program accepting a seed and a state to setup a std::default_random_engine, which is a typedef to std::linear_congruential_engine (on my system at least).
The seed() method is used to set the initial seed and operator>> for the state.
I'm aware of the principle of seeding random number generators (RNG), but used it interchangeable with its state.
The seed is the value used to initialise the generator, the state is the current state of the generator after each call to generate a random number. For very simple random number generators, such as linear congruential ones, the seed and the state are the same thing (or at least, are stored in the same variable), but they certainly don't have to be.
If you (re-)seed a PRNG, thus (re-)initializing it, you replace its current state with a new one which is a (possibly trivial) function of the seed. This initial function is often more complex to distribute the entropy over all the state, in an effort to alleviate patterns in the input.
Restoring the internal state with operator>> uses such a trivial mapping.
Whichever (re-)seeding is done last is effective, the rest just being wasted effort.
Disclaimer: I am not an expert regarding the theoretical aspect of random number generators, most of the content below actually comes from the C++ standard itself.
The state Xi of a generator, is the actual internal state of a generator - the (minimal) information you need to generate the next state and the next value of the generator:
you transition from one state to another using a "transition" function, Xi+1 = TA(Xi);
you generate a value by using a "generation" function, GA(Xi).
The seed S is a "value" used to generate the first state X0.
The state of a generator is inherent to it (see below), two implementations of the same generator will have the same state (in a theoretical sense), but may require completely different seeds.
In C++, you seed a Mersenne Twister engine using a single value, while its state is a sequence of integers... I could choose to implement a Mersenne Twister engine that could only be seed by a sequence of integer (that would become the first state).
Some examples from the standard random library to better understand:
A linear congruential generator is simply a generator that follows a recursive relation of the form:
X i+1 = TA(Xi) = (a . Xi + c) mod m
Where a, c and m are parameters of the generator. In this case:
the state of the engine is simply the current value of Xi;
the seed is the value of X0;
...when you seed the generator with a value k, you actually simply set Xi = k.
But there are other generators, e.g. in the C++11 standard random library, you will find Mersenne Twister (MT) generators.
I am certainly not an expert on Mersenne Twister generator, but from the c++ draft, you can see that:
the state of a MT generator is actually a sequence of integer Xi = (Xi,0, Xi, 1, ...);
the seed for a MT (in the standard) is actually a single value;
...when you seed the generator with a value k, you actually perform "complex" operations1 to compute X0 from k, but X0 is certainly not k.
1 I will not go into details for these operations, because I do not really know them, but you can look at the standard to see how the state (sequence) is generated from the seed.
Typically, a seed value is used to generate random numbers. Often for performance reasons, the seed value is used to generate a set of numbers, which are then added to an array or memory table. Further calls to fetch random numbers simply read pre-generated values from the table. The current index into the table and the current set of values in the table can be considered the state. When working with complex systems that can utilize a lot of call for random numbers (An example would be a game engine) debugging (reproducing bugs) becomes very difficult unless you can reproduce "random" events. If you save and then restore the "state" and seed value you can ensure that when the code runs it will choose the same "random" numbers each time. I'm sure that is just one of many possible answers, but hope that helps.
I can use:
boost::mt19937 gen(43);
this works just fine, but what if I want more than 32-bits of seed before using the random number generator? Is there an easy way to put 64-bits or 128-bits of seed into the Mersenne Twister?
I found a few examples of loading multiple values before generating results, but none of the code works.
There are a couple of problems with this code:
std::vector<unsigned int> seedv(1000, 11);
std::vector<unsigned int>::iterator i=seedv.begin();
boost::mt19937 gen2(i, seedv.end());
First, calling gen2() always returns the same value. I don't know how I screwed that up.
Second, I don't want 1,000 seeds, but when I lower it to 600 it "throws an instance of std::invalid_argument with note enough elements in call to seed"
Can this method be shortened to a handful of seeds?
Here is another code example that looks easy:
std::string seedv("thisistheseed");
std::seed_seq q(seedv.begin(),seedv.end());
boost::mt19937 gen2(q);
but it won't compile. I finally figured out that std::seed_seq is only available in c++11. I am stuck with gcc 4.7 until the libraries I depend on are stable.
I suppose I can just stick with a 32-bit seed, but I wanted a little bit more.
I did read this article:
Boost Mersenne Twister: how to seed with more than one value?
I like the idea of initializing the whole vector from:
mersenne_twister(seed1) ^ mersenne_twister(seed2)
but I don't see a way to do that without modifying Mersenne_Twister.hpp
Any suggestions?
UPDATE: one more way not to do it!
unsigned long seedv[4];
seedv[0]=1;
seedv[1]=2;
seedv[2]=3;
seedv[3]=4;
boost::mt19937 gen2(seedv,4);
With the right casting, this should work, but every cast I have tried still won't get past the compiler. I can cast anything in C, but C++ still stumps me at times...
Use boost::seed_seq instead of std::seed_seq.
Update: Use
boost::random_device rd;
boost::mt19937 eng(rd);
boost::mt19937 allows you to seed it with either a single value up to 32 bits in width (the w parameter of the underlying boost::mersenne_twister_engine), or with a sequence of 624 values (the n parameter of the underlying template). 624 is the number of elements in mt19937's internal state.
If you read the documentation you'll see that these two mechanisms seed the state of the engine differently.
seeding with a single value sets every element of the engine's state using a complicated function of that single value.
seeding with a sequence of 624 values sets each element of the engine state to exactly the corresponding seed value.
The point is that boost::mt19937 does not itself include a mechanism to map an arbitrary number of seed values to the fixed number of elements in its internal state. It allows you to set the state directly (using 624 values), and for convenience it offers a built-in mapping from single 32-bit values to complete, 624 element states.
If you want to use an arbitrary number of input seed values then you will need to implement a mapping from arbitrarily sized sequences to 624 element states.
Keep in mind that the mapping should be designed such that the resulting internal state is a 'good' state for the mersenne twister algorithm. This algorithm is shift-register based and can be subject to internal states which produce relatively predictable output. The built-in mapping for single values was designed to minimize this issue and whatever you implement should be as well.
Probably the best way to implement such a mapping, rather than doing the mathematical analysis yourself, is to simply use the standard mersenne twister warmup algorithm. According to a comment on this answer the C++11 std::seed_seq is specified to perform such a warmup. Boost includes boost::seed_seq, which presumably does the same thing.
Update: Instead of using some arbitrary number of values to compute a sequence of 624 values you can simply use exactly 624 random values. If the values are unbiased and evenly distributed over the range of 32-bit values then you won't need any warm-up (well, unless you're astronomically unlucky).
The boost <random> library supports this very directly:
boost::random_device rd;
boost::mt19937 eng(rd);
Note that the C++11 <random> library does not support seeding this way.
A seed-sequence generator helper exists for this reason:
The class seed_seq stores a sequence of 32-bit words for seeding a pseudo-random number generator . These words will be combined to fill the entire state of the generator. http://www.boost.org/doc/libs/1_57_0/doc/html/boost/random/seed_seq.html
#include <boost/random.hpp>
#include <boost/random/seed_seq.hpp>
int main()
{
boost::random::seed_seq ss({1ul, 2ul, 3ul, 4ul});
boost::mt19937 gen2(ss);
}
You can also pass a pair of iterators to an existing range.
Live On Coliru
The standard Mersenne Twister is the following typedef:
typedef mersenne_twister_engine<uint32_t,32,624,397,31,0x9908b0df,
11,0xffffffff,7,0x9d2c5680,15,0xefc60000,18,1812433253> mt19937;
The first template type is called UintType and is used as the argument for the seed method:
void seed(UIntType value);
You could therefore use the pre-defined boost::mt19937_64 to have a 64-bit seed. You can also create your own mersenne_twister_engine if you want to customize it more.
It's not pretty, but it should be possible to do:
std::vector<unsigned int> seedv(1000);
seedv[0] = 1;
seedv[1] = 2;
seedv[2] = 3;
seedv[3] = 4;
// etc.
boost::mt19937 gen2(seedv.begin(), seedv.end());
That is, still pass in 1000 seeds, even though most of them are 0's. However, as #bames53 mentions in a comment, this isn't a good idea. It's legal, and compiles, but doesn't make for a good seed.
By the way, the array approach should work (at least, it should compile):
unsigned long seedv[4];
seedv[0]=1;
seedv[1]=2;
seedv[2]=3;
seedv[3]=4;
boost::mt19937 gen2(seedv, seedv + 4);
This is an example of using pointers as iterators. 4 isn't an iterator (generally), but seedv + 4 is (it's the address of the element just after the end of seedv)..