Generating number (0,1) using mersenne twister c++ - c++

I'm working on implementing R code into C++ so that it runs faster, but I am having difficulties implementing mersenne twister. I only wish to generate values between (0,1). Here is what I have that pertains to this question.
#include <random>
std::mt19937 generator (123);
std::cout << "Random value: " << generator() << std:: endl;
I tried dividing by RAND_MAX, but that did not produce the values that I was looking for.
Thanks in advance.

In C++11 the concepts of "(pseudo) random generator" and "probability distribution" are separated, and for good reasons.
What you want can be achieved with the following lines:
std::mt19937 generator (123);
std::uniform_real_distribution<double> dis(0.0, 1.0);
double randomRealBetweenZeroAndOne = dis(generator);
If you want to understand why this separation is necessary, and why using a standard division /range manipulation on the output of the generator is a bad idea, watch this video.

You may want to consider code like this:
// For pseudo-random number generators and distributions
#include <random>
...
// Use random_device to generate a seed for Mersenne twister engine.
std::random_device rd{};
// Use Mersenne twister engine to generate pseudo-random numbers.
std::mt19937 engine{rd()};
// "Filter" MT engine's output to generate pseudo-random double values,
// **uniformly distributed** on the closed interval [0, 1].
// (Note that the range is [inclusive, inclusive].)
std::uniform_real_distribution<double> dist{0.0, 1.0};
// Generate pseudo-random number.
double x = dist(engine);
For more details on generating pseudo-random numbers in C++ (including reasons why rand() is not good), see this video by Stephan T. Lavavej (from Going Native 2013):
rand() Considered Harmful

std::mt19937 does not generate between 0 and RAND_MAX like rand(), but between 0 and 2^32-1
And by the way, the class provides min() and max() values!
You need to convert the value to a double, substract min() and divide by max()-min()
uint32_t val;
val << generator;
double doubleval = ((double)val - generator::min())/(generator::max()-generator::min());
or (less generic)
uint32_t val;
val << generator;
double doubleval = (double)val * (1.0 / std::numeric_limits<std::uint32_t>::max());

Related

Cpp random number with rand returns very similar numbers

I'm trying to generate a random number using rand() command, but each time i get very similar numbers.
This is my code:
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
srand(time(0));
cout << rand();
return 0;
}
I ran it 5 times and the numbers i got are:
21767
21806
21836
21862
21888
How can i make the numbers be more different?
From the documentation of rand:
There are no guarantees as to the quality of the random sequence produced. In the past, some implementations of rand() have had serious shortcomings in the randomness, distribution and period of the sequence produced (in one well-known example, the low-order bit simply alternated between 1 and 0 between calls).
rand() is not recommended for serious random-number generation needs. It is recommended to use C++11's random number generation facilities to replace rand().
It (and I) recommend to use the newer c++11 random number generators in <random>.
In your specific case it seems you want a std::uniform_int_distribution. An example, as given on the linked page is:
std::random_device rd; //Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
std::uniform_int_distribution<> distrib(1, RAND_MAX);
std::cout << distrib(gen) << '\n';

Proper way to generate random long long?

I am trying to generate random long long numbers using this code in C++:
random_device rd;
default_random_engine gen(rd());
uniform_int_distribution<long long> distribution(1, llround(pow(10, 12)));
long long random_num = distribution(gen);
I just want to verify that this should generate random integers from [1, 10^12] uniformly. Is this the correct way to do it?
EDIT:
random_device rd;
mt19937_64 gen(rd());
uniform_int_distribution<long long> distribution(1, llround(pow(10, 12)));
long long random_num = distribution(gen);
Combining all the comments above,
Prefer std::mt19937 or std::mt19937_64 to std::default_random_engine to get high-quality random numbers. You can use both generators with std::uniform_int_distribution. Distributions know how to manage generators whose output range is smaller then the target range.
Instead of std::llround(std::pow(10, 12)) prefer the compile-time constant (1'000'000'000'000LL). In general, not every integer can be represented exactly, and you might get unexpected results. For example, std::llround(std::pow(9, 17)) gives 16677181699666568, whereas the 9^17 = 16677181699666569.
Seeding a generator with a single number (typically, 32-bit), std::mt19937_64 gen(rd()), is the simplest thing you can do. In some situations you might want to use more bits of entropy, i.e. not just a single number, but a seed sequence. See this question for more details.

How to Understand C++11 random number generator

Those three lines of generating random number looks a bit tricky. It is hard to always remember those lines. Could someone please shed some light on it to make it easier to understand?
#include <random>
#include <iostream>
int main()
{
std::random_device rd; //1st line: Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); //2nd line: Standard mersenne_twister_engine seeded with rd()
std::uniform_int_distribution<> dis(1, 6);
for (int n=0; n<10; ++n)
std::cout << dis(gen) << ' '; //3rd line: Use dis to transform the random unsigned int generated by gen into an int in [1, 6]
std::cout << '\n';
}
Here are some questions I can think of:
1st line of code:
random_device is a class as described by the documentation random_device, so this line means declaring a object rd? If yes, why in 2nd line we pass rd() to construct mt19937 instead of using the object rd (without parentheses)?
3rd line of code:
Why do call class uniform_int_distribution<> object dis()? Is dis() a function? Why shall we pass in gen object into dis()?
random_device is slow but genuinely random, it's used to generate the 'seed' for the random number sequence.
mt19937 is fast but only 'pseudo random'. It needs a 'seed' to start generating a sequence of numbers. That seed can be random (as in your example) so you get a different sequence of random numbers each time. But it could be a constant, so you get the same sequence of numbers each time.
uniform_int_distribution is a way of mapping random numbers (which could have any values) to the numbers you're actually interested in, in this case a uniform distribution of integers from 1 to 6.
As is often the case with OO programming, this code is about division of responsibilities. Each class contributes a small piece to the overall requirement (the generation of dice rolls). If you wanted to do something different it's easy because you've got all the pieces in front of you.
If this is too much then all you need to do is write a function to capture the overall effect, for instance
int dice_roll()
{
static std::random_device rd;
static std::mt19937 gen(rd());
static std::uniform_int_distribution<> dis(1, 6);
return dis(gen);
}
dis is an example of a function object or functor. It's an object which overloads operator() so it can be called as if it was a function.
std::random_device rd; // create access to truly random numbers
std::mt19937 gen{rd()}; // create pseudo random generator.
// initialize its seed to truly random number.
std::uniform_int_distribution<> dis{1, 6}; // define distribution
...
auto x = dis(gen); // generate pseudo random number form `gen`
// and transform its result to desired distribution `dis`.

How many random numbers does std::uniform_real_distribution use?

I was surprised to see that the output of this program:
#include <iostream>
#include <random>
int main()
{
std::mt19937 rng1;
std::mt19937 rng2;
std::uniform_real_distribution<double> dist;
double random = dist(rng1);
rng2.discard(2);
std::cout << (rng1() - rng2()) << "\n";
return 0;
}
is 0 - i.e. std::uniform_real_distribution uses two random numbers to produce a random double value in the range [0,1). I thought it would just generate one and rescale that. After thinking about it I guess that this is because std::mt19937 produces 32-bit ints and double is twice this size and thus not "random enough".
Question: How do I find out this number generically, i.e. if the random number generator and the floating point type are arbitrary types?
Edit: I just noticed that I could use std::generate_canonical instead, as I am only interested in random numbers of [0,1). Not sure if this makes a difference.
For template<class RealType, size_t bits, class URNG> std::generate_canonical the standard (section 27.5.7.2) explicitly defines the number of calls to the uniform random number generator (URNG) to be
max(1, b / log_2 R),
where b is the minimum of the number of bits in the mantissa of the RealType and the number of bits given to generate_canonical as template parameter.
R is the range of numbers the URNG can return (URNG::max()-URNG::min()+1).
However, in your example this will not make any difference, since you need 2 calls to the mt19937 to fill the 53 bits of the mantissa of the double.
For other distributions the standard does not provide a generic way to get any information on how many numbers the URNG has to generate to obtain one number of the distribution.
A reason might be that for some distributions the number uniform random numbers required to generate a single number of the distribution is not fixed and may vary from call to call. An example is the std::poisson_distribution, which is usually implemented as a loop which draws a uniform random number in each iteration until the product of these numbers has reached a certain threshold (see for example the implementation of the GNU C++ library (line 1523-1528)).

Mersenne twister warm up vs. reproducibility

In my current C++11 project I need to perform M simulations. For each simulation m = 1, ..., M, I randomly generate a data set by using a std::mt19937 object, constructed as follows:
std::mt19937 generator(m);
DatasetFactory dsf(generator);
According to https://stackoverflow.com/a/15509942/1849221 and https://stackoverflow.com/a/14924350/1849221, the Mersenne Twister PRNG benefits from a warm up phase, which is currently absent in my code. I report for convenience the proposed snippet of code:
#include <random>
std::mt19937 get_prng() {
std::uint_least32_t seed_data[std::mt19937::state_size];
std::random_device r;
std::generate_n(seed_data, std::mt19937::state_size, std::ref(r));
std::seed_seq q(std::begin(seed_data), std::end(seed_data));
return std::mt19937{q};
}
The problem in my case is that I need reproducibility of results, i.e., among different executions, for each simulation, the data set has to be the same. That's the reason why in my current solution I use the current simulation to seed the Mersenne Twister PRNG. It seems to me that the usage of std::random_device prevents data from being the same (AFAIK, this is the exact purpose of std::random_device).
EDIT: by different executions I mean re-launching the executable.
How can I introduce the afore-mentioned warm up phase in my code without affecting reproducibility? Thanks.
Possible solution #1
Here's a tentative implementation based on the second proposal by #SteveJessop
#include <random>
std::mt19937 get_generator(unsigned int seed) {
std::minstd_rand0 lc_generator(seed);
std::uint_least32_t seed_data[std::mt19937::state_size];
std::generate_n(seed_data, std::mt19937::state_size, std::ref(lc_generator));
std::seed_seq q(std::begin(seed_data), std::end(seed_data));
return std::mt19937{q};
}
Possible solution #2
Here's a tentative implementation based on the joint contribution by #SteveJassop and #AndréNeve. The sha256 function is adapted from https://stackoverflow.com/a/10632725/1849221
#include <openssl/sha.h>
#include <sstream>
#include <iomanip>
#include <random>
std::string sha256(const std::string str) {
unsigned char hash[SHA256_DIGEST_LENGTH];
SHA256_CTX sha256;
SHA256_Init(&sha256);
SHA256_Update(&sha256, str.c_str(), str.size());
SHA256_Final(hash, &sha256);
std::stringstream ss;
for(int i = 0; i < SHA256_DIGEST_LENGTH; i++)
ss << std::hex << std::setw(2) << std::setfill('0') << (int)hash[i];
return ss.str();
}
std::mt19937 get_generator(unsigned int seed) {
std::string seed_str = sha256(std::to_string(seed));
std::seed_seq q(seed_str.begin(), seed_str.end());
return std::mt19937{q};
}
Compile with: -I/opt/ssl/include/ -L/opt/ssl/lib/ -lcrypto
Two options:
Follow the proposal you have, but instead of using std::random_device r; to generate your seed sequence for MT, use a different PRNG seeded with m. Choose one that doesn't suffer like MT does from needing a warmup when used with small seed data: I suspect an LCG will probably do. For massive overkill, you could even use a PRNG based on a secure hash. This is a lot like "key stretching" in cryptography, if you've heard of that. You could in fact use a standard key stretching algorithm, but you're using it to generate a long seed sequence rather than large key material.
Continue using m to seed your MT, but discard a large constant amount of data before starting the simulation. That is to say, ignore the advice to use a strong seed and instead run the MT long enough for it to reach a decent internal state. I don't know off-hand how much data you need to discard, but I expect the internet does.
I think that you only need to store the initial seed (in your case the std::uint_least32_t seed_data[std::mt19937::state_size] array) and the number n of warmup steps you made (eg. using discard(n) as mentioned) for each run/simulation you wish to reproduce.
With this information, you can always create a new MT instance, seed it with the previous seed_data and run it for the same n warmup steps. This will generate the same sequence of values onwards since the MT instance will have the same inner state when the warmup ends.
When you mention the std::random_device affecting reproducibility, I believe that in your code it is simply being used to generate the seed data. If you were using it as the source of random numbers itself, then you would not be able to have reproducible results. Since you are using it only to generate the seed there shouldn't be any problem. You just can't generate a new seed every time if you want to reproduce values!
From the definition of std::random_device:
"std::random_device is a uniformly-distributed integer random number generator that produces non-deterministic random numbers."
So if it's not deterministic you cannot reproduce the sequence of values produced by it. That being said, use it simply to generate good random seeds only to store them afterwards for the re-runs.
Hope this helps
EDIT :
After discussing with #SteveJessop, we arrived at the conclusion that a simple hash of the dataset (or part of it) would be sufficient to be used as a decent seed for the purpose you need. This allows for a deterministic way of generating the same seeds every time you run your simulations. As mentioned by #Steve, you will have to guarantee that the size of the hash isn't too small compared with std::mt19937::state_size. If it is too small, then you can concatenate the hashes of m, m+M, m+2M, ... until you have enough data, as he suggested.
I am posting the updated answer here as the idea of using a hash was mine, but I will upvote #SteveJessop's answer because he contributed to it.
A comment on one of the answers you link to indicates:
Coincidentally, the default C++11 seed_seq is the Mersenne Twister warmup sequence (although the existing implementations, libc++'s mt19937 for example, use a simpler warmup when a single-value seed is provided)
So you may be able to use your current fixed seeds with std::seed_seq to do the warm-up for you.
std::mt19937 get_prng(int seed) {
std::seed_seq q{seed, maybe, some, extra, fixed, values};
return std::mt19937{q};
}