According to this post, intuitive seeding with std::random_device may not produce the expected results. In particular, if the Mersenne Twister engine is used, not all the initialization states can be reached. Using seed_seq doesn't helper either, since it is not a bijection.
This all, as far as I understand, means that not the std::uniform_int_distribution will not really be uniform - because not all seed values are possible.
I'd like to simply generate a couple of random numbers. While this is a really interesting topic which I will certainly devote some of my free time, many people may not have this possibility.
So the question is: how should I properly seed the std::default_random_engine so that it simply does what I expect?
A uniform_int_distribution will still be uniform however you seed it. But better seeding can reduce chances of getting the same sequence of uniformly distributed values.
I think for most purposes using a std::seed_seq with about 8 random 32bit ints from std::random_device should be sufficient. It is not perfect, for the reasons given in the post you linked but if you need really secure numbers for cryptographic purposes you shouldn't really be using a pseudo random number generator anyway:
constexpr std::size_t SEED_LENGTH = 8;
std::array<uint_fast32_t, SEED_LENGTH> generateSeedData() {
std::array<uint_fast32_t, SEED_LENGTH> random_data;
std::random_device random_source;
std::generate(random_data.begin(), random_data.end(), std::ref(random_source));
return random_data;
}
std::mt19937 createEngine() {
auto random_data = generateSeedData();
std::seed_seq seed_seq(random_data.begin(), random_data.end());
return std::mt19937{ seed_seq };
}
Related
Are these pieces of code equivalent in terms of "randomness" ?
1)
std::vector<int> counts(20);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 19);
for (int i = 0; i < 10000; ++i) {
++counts[dis(gen)];
}
2)
std::vector<int> counts(20);
std::random_device rd;
std::mt19937 gen(rd());
for (int i = 0; i < 10000; ++i) {
std::uniform_int_distribution<> dis(0, 19);
++counts[dis(gen)];
}
3)
std::vector<int> counts(20);
std::random_device rd;
for (int i = 0; i < 10000; ++i) {
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 19);
++counts[dis(gen)];
}
4)
std::vector<int> counts(20);
for (int i = 0; i < 10000; ++i) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 19);
++counts[dis(gen)];
}
In the documentation of std::random_device, it is says that multiple std::random_device object may generate the same number sequence so the code 4 is bad, isn't it ?
And for the other codes ?
If I need to generate random values for multiple stuffs unrelated, must I need to create differents generators or can I keep the same ? :
1)
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> disInt(0, 10);
std::uniform_float_distribution<> disFloat(0, 1.0f);
// Use for one stuff
disInt(gen);
// Use same gen for another unrelated stuff
disFloat(gen);
2)
std::random_device rd1, rd2;
std::mt19937 gen1(rd1()), gen2(rd2());
std::uniform_int_distribution<> disInt(0, 10);
// Use for one stuff
disInt(gen1);
// Use another gen for another unrelated stuff
disFloat(gen2);
The point of the random generator is to hold the state of the algorithm, in order to produce repeatable pseudorandom sequences of numbers based on a specific random seed.
The point of the random device is to provide a random seed for the random generator.
If you try seeding a new generator for every random value, you are no longer exercising the randomness provided by the random generator's algorithm. Instead, you are biasing the generator to rely on the randomness of the random device itself.
For this reason, examples #3 and #4 are not advisable.
The correct way to generate a random sequence is example #1:
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 19);
for (int i = 0; i < 10000; ++i) {
int foo = dis(gen);
}
Example #2 is also correct, but it's kinda pointless to construct the uniform_int_distribution inside the loop. Of course, with compiler optimizations, it doesn't really hurt and there may be times where it's preferable to keep the distribution near where it used, for the sake of clarity.
If I need to generate random values for multiple stuffs unrelated, must I need to create differents generators or can I keep the same ?
You are welcome to use multiple generators for unrelated random sequences if you want to -- that is actually one of their major draw-cards. You retain the randomness guarantees of the pseudorandom algorithm for a particular sequence if its generator is not used when generating other sequences (most notably when extraction of numbers from the sequences are interleaved).
This is also useful for reproducibility: For example, when you actually have a specific seed value (instead of pulling it from a random device), using that seed for one particular sequence gives repeatable results, regardless of any other sequences being used at the same time.
One other major benefit is that by using separate generators you get the same thread-safety guarantees that apply to other objects. In other words, that means if you want to generate multiple pseudorandom sequences concurrently, you can do so without locks provided each thread operates on a separate generator.
As you correctly mentioned, std::random_device may always returns the same sequence. This happens in particular on MinGW, where the behaviour is completely deterministic across multiple runs of any program using std::random_device.
The behaviour of std::mt19937 and std::uniform_int_distribution is deterministic given their inputs. As a result, on MinGW, the randomness of all four snippets is equally bad, each of them will always return the same sequence (albeit it will probably be a different sequence for each snippet).
If you're worried about that, use std::chrono::high_resolution_clock to initialise std::mt19937, either in lieu of, or in conjunction with std::random_device.
random_device usage
The first two loops are completely equivalent because the uniform_int_distribution type is stateless (as are all such distributions). The second might be slightly slower: I see one superfluous stack store with GCC or Clang at -O3.
For common random_device implementations based on something like /dev/random, the latter two loops are also equivalent: the random_device is merely a handle and accesses the same entropy pool regardless of intervening destruction and reinitialization. An implementation could, however, retain bits not needed for one seed and use them for another. This is of course harder to test, since the state is deliberately non-reproducible. (majk is correct that a deterministic implementation of random_device makes the last loop produce only a single value.)
Repeated random_device usage
The former two are generally to be preferred: generating many random numbers from a single seed is the point of a PRNG, and forsaking that is likely to quickly drain the entropy pool in common installations. Depending on the implementation, your process might block (extensively) waiting for more entropy or might fall back on an OS-supplied PRNG. In either case, you deprive other processes of true entropy.
Using the mt19937 here doesn’t add much anyway: the random_device can already be used directly with the uniform_int_distribution with the minor disadvantage of (rarely) polling the random_device multiple times to obtain a uniform distribution over 20 values (since that’s not a power of 2).
Different streams
It’s perfectly reasonable to use one generator for various distributions (interleaved or no). There are cases where you would want to use separate ones, usually with multiple threads or where you want to control seeding. As an example of the latter, you might define some sort of procedural content generation in terms of a PRNG with a particular seed. If other random numbers are (or in a future version come to be) needed during the generation, using a separate generator for them allows the content generator to function identically for any such additional random number usage.
I have a bash script that starts many client processes. These are AI game players that I'm using to test a game with many players, on the order of 400 connections.
The problem I'm having is that the AI player uses
srand( time(nullptr) );
But if all the players start at approximately the same time, they will frequently receive the same time() value, which will mean that they are all on the same rand() sequence.
Part of the testing process is to ensure that if lots of clients try to connect at approximately the same time, the server can handle it.
I had considered using something like
srand( (int) this );
Or similar, banking on the idea that each instance has a unique memory address.
Is there another better way?
Use a random seed to a pseudorandom generator.
std::random_device is expensive random data. (expensive as in slow)
You use that to seed a prng algorithm. mt19937 is the last prng algorithm you will ever need.
You can optionally follow that up by feeding it through a distribution if your needs require it. i.e. if you need values in a certain range other than what the generator provides.
std::random_device rd;
std::mt19937 generator(rd());
These days rand() and srand() are obsolete.
The generally accepted method is to seed a pseudo random number generator from the std::random_device. On platforms that provide non-deterministic random sources the std::random_device is required to use them to provide high quality random numbers.
However it can be slow or even block while gathering enough entropy. For this reason it is generally only used to provide the seed.
A high quality but efficient random engine is the mersenne twister provided by the standard library:
inline
std::mt19937& random_generator()
{
thread_local static std::mt19937 mt{std::random_device{}()};
return mt;
}
template<typename Number>
Number random_number(Number from, Number to)
{
static_assert(std::is_integral<Number>::value||std::is_floating_point<Number>::value,
"Parameters must be integer or floating point numbers");
using Distribution = typename std::conditional
<
std::is_integral<Number>::value,
std::uniform_int_distribution<Number>,
std::uniform_real_distribution<Number>
>::type;
thread_local static Distribution dist;
return dist(random_generator(), typename Distribution::param_type{from, to});
}
You use a random number seed if and only if you want reproducible results. This can be handy for things like map generation where you want the map to be randomized, but you want it to be predictably random based on the seed.
For most cases you don't want that, you want actually random numbers, and the best way to do that is through the Standard Library generator functions:
#include <random>
std::random_device rd;
std::map<int, int> hist;
std::uniform_int_distribution<int> dist(0, 5);
int random_die_roll = dist(rd);
No seed is required nor recommended in this case. The "random device" goes about seeding the PRNG (pseudo random number generator) properly to ensure unpredictable results.
Again, DO NOT use srand(time(NULL)) because it's a very old, very bad method for initializing random numbers and it's highly predictable. Spinning through a million possible seeds to find matching output is trivial on modern computers.
I'm trying so seed the random function with errno :
#include <stddef.h>
#include <string.h>
int main(void){
srand(&errno);
srand(strerror(0));
return rand();
}
When writing code that requires multiple independent random number distributions/sequences (example below with two), it seems that there are two typical ways to implement (pseudo-)random number generation. One is simply using a random_device object to generate two random seeds for the two independent engines:
std::random_device rd;
std::mt19937 en(rd());
std::mt19937 en2(rd());
std::uniform_real_distribution<> ureald{min,max};
std::uniform_int_distribution<> uintd{min,max};
The other involves using the random_device object to create a seed_seq object using multiple "sources" of randomness:
// NOTE: keeping this here for history, but a (hopefully) corrected version of
// this implementation is posted below the edit
std::random_device rd;
std::seed_seq seedseq{rd(), rd(), rd()}; // is there an optimal number of rd() to use?
std::vector<uint32_t> seeds(5);
seedseq.generate(seeds.begin(), seeds.end());
std::mt19937 en3(seeds[0]);
std::mt19937 en4(seeds[1]);
std::uniform_real_distribution<> ureald{min,max};
std::uniform_int_distribution<> uintd{min,max};
Out of these two, is there a preferred method? Why? If it is the latter, is there an optimal number of random_device "sources" to use in generating the seed_seq object?
Are there better approaches to random number generation than either of these two implementations I've outlined above?
Thank you!
Edit
(Hopefully) corrected version of seed_seq implementation for multiple distributions:
std::random_device rd;
std::seed_seq seedseq1{rd(), rd(), rd()}; // is there an optimal number of rd() to use?
std::seed_seq seedseq2{rd(), rd(), rd()};
std::mt19937 en3(seedseq1);
std::mt19937 en4(seedseq2);
std::uniform_real_distribution<> ureald{min,max};
std::uniform_int_distribution<> uintd{min,max};
std::seed_seq is generally intended to be used if you don't trust the default implementation to properly initialize the state of the engine you're using.
In many ≥C++11 implementations, std::default_random_engine is an alias for std::mt19937, which is a specific variant of the Mersenne Twister Pseudorandom Number Generation algorithm. Looking at the specification for std::mt19937, we see that it has a state of size 624 unsigned integers, which is enough to hold the 19937 bits of state it is intended to encompass (which is how it gets its name). Traditionally, if you seed it with only a single uint32_t value (which is what you would get from calling rd() once, if rd is a std::random_device object), then you're leaving the vast majority of its state uninitialized.
Now, the good news for anyone about to panic about their poorly-seeded Mersenne Twister engines is that if you construct a std::mt19937 with a single uint32_t value (like std::default_random_engine engine{rd()};), the implementation is required to initialize the rest of the state by permutating the original seed value, so while a single invocation of rd() yields a limited range of actual differing engine states, it's still sufficient to at least properly initialize the engine. This will yield a "Good Quality" random number generator.
But if you're worried about the engine not being properly seeded, either for cryptographic reasons (though note that std::mt19937 itself is NOT cryptographically secure!) or simply for statistical reasons, you can use a std::seed_seq to manually specify the entire state, using rd() to fill in each value, so that you can guarantee to a relative degree of confidence that the engine is properly seeded.
For casual use, or scenarios where it's not strictly necessary to achieve high quality random numbers, simply initializing with a single call to std::random_device::operator() is fine.
If you want to use a std::seed_seq, make sure you set it up correctly (the example in your original code is definitely not correct, at least for std::mt19937, and would actually yield much worse results than simply using rd()!). This post on CodeReview contains code which has been vetted properly.
Edit:
For the predefined templates of Mersenne Twister, the state size is always 19968 bits, which is slightly more than what it actually needs, but also the smallest value that can fully represent the range using uint32_t values. This works out to 624 Words of 32-bits each. So if you plan to use a Seed Sequence, you would correctly initialize it with 624 invocations to rd():
//Code copied from https://codereview.stackexchange.com/questions/109260/seed-stdmt19937-from-stdrandom-device
std::vector<uint32_t> random_data(624);
std::random_device source;
std::generate(random_data.begin(), random_data.end(), std::ref(source));
std::seed_seq seeds(random_data.begin(), random_data.end());
std::mt19937 engine(seeds);
//Or:
//std::mt19937_64 engine(seeds);
If you're working with a non-standard instantiation of std::mersenne_twister_engine, the state size needed for that specific situation can be queried by multiplying its state_size by its word_size and then dividing by 32.
using mt_engine = std::mersenne_twister_engine</*...*/>;
constexpr size_t state_size = mt_engine::state_size * mt_engine::word_size / 32;
std::vector<uint32_t> random_data(state_size);
std::random_device source;
std::generate(random_data.begin(), random_data.end(), std::ref(source));
std::seed_seq seeds(random_data.begin(), random_data.end());
mt_engine engine (seeds);
For other engine types, you'll need to evaluate them on a case-by-case basis. std::linear_congruential_engine and its predefined variants use a single integer of its word size, so they only require a single invocation of rd() to initialize, and thus Seed Sequences are unnecessary. I'm not sure how std::subtract_with_carry_engine or its associated-by-use std::discard_block_engine work, but it seems like they also only contain a single Word of state.
According to the standard, std::random_device works the following way:
result_type operator()();
Returns: A non-deterministic random value, uniformly distributed between min() and max(), inclusive. It is implementation-defined
how these values are generated.
And there are a couple of ways you can use it. To seed an engine:
std::mt19937 eng(std::random_device{}());
As an engine in itself:
std::uniform_int_distribution<> uid(1, 10);
std::cout << dist(dev);
Because it is implementation-defined, it doesn't sound as strong as say std::seed_seq or srand(time(nullptr)). Do I prefer to use it as a seed, as an engine or not at all?
Generally speaking, std::random_device should be the source of the most truly random information you can access on your platform. That being said, accessing it is much slower than std::mt19937 or what not.
The correct behavior is to use std::random_device to seed something like std::mt19937.
There seems to be some mythology around the use of mt19937, specifically that once seeded 'some' number of bits produced by the generator should be ignored so as to have only as near as is possible to pseudo randomness.
Examples of code I've seen are as follows:
boost::mt19937::result_type seed = 1234567; //taken from some entropy pool etc
boost::mt19937 prng(seed);
boost::uniform_int<unsigned int> dist(0,1000);
boost::variate_generator<boost::mt19937&,boost::uniform_int<unsigned int> > generator(prng,dist);
unsigned int skip = 10000;
while (skip--)
{
generator();
}
//now begin using for real.
....
My questions are:
Is this myth or is there some truth to it all?
If it's something viable, how many bits should be ignored? as the numbers I've seen
seem to be arbitrary
The paper referenced in the first comment, Mersenne Twister with improved initialization, isn't just some guy, he's one of the two co-authors of the paper upon which the Boost implementation is based.
The problem with using a single 32-bit integer (4 bytes) as a seed for this generator is that the internal state of the generator is 2496 bytes, according to the Boost documentation. It's not too surprising that such a small seed takes a while to propagate into the rest of the internal state of the generator, particular since the Twister isn't meant to be cryptographically secure.
To address your concern about needing to run the generator for a while to get started, you want the alternate (and explicit) constructor.
template<typename SeedSeq> explicit mersenne_twister_engine(SeedSeq &);
This is the spirit of the third comment, where you initialize with something longer than a single integer. The sequence provide comes from some generator. To use an entropy pool, write a generator as an adapter from an entropy pool, returning values from the pool as needed.