Random numbers from binomial distribution - c++

I need to generate quickly lots of random numbers from binomial distributions for dramatically different trial sizes (most, however, will be small). I was hoping not to have to code an algorithm by hand (see, e.g., this related discussion from November), because I'm a novice programmer and don't like reinventing wheels. It appears Boost does not supply a generator for binomially distributed variates, but TR1 and GSL do. Is there a good reason to choose one over the other, or is it better that I write something customized to my situation? I don't know if this makes sense, but I'll alternate between generating numbers from uniform distributions and binomial distributions throughout the program, and I'd like for them to share the same seed and to minimize overhead. I'd love some advice or examples for what I should be considering.

Boost 1.43 appears to support binomial distributions. You can use boost::variate_generator to connect your source of randomness to the type
of distribution you want to sample from.
So your code might look something like this (Disclaimer: not tested!):
boost::mt19937 rng; // produces randomness out of thin air
// see pseudo-random number generators
const int n = 20;
const double p = 0.5;
boost::binomial<> my_binomial(n,p); // binomial distribution with n=20, p=0.5
// see random number distributions
boost::variate_generator<boost::mt19937&, boost::binomial<> >
next_value(rng, my_binomial); // glues randomness with mapping
int x = next_value(); // simulate flipping a fair coin 20 times

You misunderstand the Boost model - you choose a random number generator type and then a distribution on which to spread the values the RNG produces over. There's a very simple example in this answer, which uses a uniform distribution, but other distributions use the same basic pattern - the generator and the distribution are completely decoupled.

Related

Independence with boost random

I am trying to use mersenne twister to generate samples from various distribution. I have one generator and it is used to generate all of them. Something strange (to me at least) happens here. On one hand calculating the correlation coefficient of the various samples gives me almost zero, which seems nice. But when I change a parameter of one distribution (which is used nowhere else), it somehow also changes the results I get in others. Specifically:
#include <boost/random.hpp>
using namespace boost; // boost random library for random generators
mt19937 generator(7687); // mersenne twister random number generator, seed = 7687
double normal_sample(double mu, double sigma)
// returns a sample from normal distribution with mean mu and variance sigma
{
normal_distribution<> norm_dist;
variate_generator<mt19937&, normal_distribution<> > norm_rnd(generator, norm_dist);
return(mu + sigma * norm_rnd());
}
double poisson_sample(double intensity)
// returns a number of points in a realization of a Poisson point process
{
poisson_distribution<> poiss_dist(intensity);
variate_generator<mt19937&, poisson_distribution<> > poiss_rnd(generator, poiss_dist);
return(poiss_rnd());
}
This is the code...the generator part, then I draw from those two distributions, changing the parameter called intensity. This changes not only the Poisson sample, but the normal one as well...actually, now that I think of it, it kind of makes sense, because my Poisson sample determines a number of points that are also randomly generated using the same generator...so then then depending on how many of them there are, I get something else, because the normal sample is generated using different numbers in the sequence. Is that correct?
If so, how would one go about changing that? Should I use multiple generators?
It probably means that depending on the parameters fewer or more random samples are extracted from the mersenne twister.
This logically implies that all other results are shifted, making all other outcomes different.
[...] it kind of makes sense, because my Poisson sample determines a number of points that are also randomly generated using the same generator...so then then depending on how many of them there are, I get something else, because the normal sample is generated using different numbers in the sequence. Is that correct?
Seems to me you got it figured out already, yes.
If you wanted repeatable PRNG, use separate PRNG states, i.e. different mersenne egnines.

How to generate a massive amount of high quality Random Numbers?

I'm working on a random walk simulation of particles moving in a lattice. For that reason I must create a massive amount of random numbers, about 10^12 and above. Currently I'm using the possibilities C++11 provides with <random>. When profiling my program, I see that a major amount of time is spent in <random>. The vast majority of those numbers are between 0 and 1, evenly distributed. Here a then I need a number from a binomial distribution. But the focus lies on the 0..1 numbers.
The question is: What can I do to reduce the CPU time needed to generate these numbers and what would the impact be on their quality?
As you can see, I tried different engines, but that had no big effect on CPU time. Further, what is the difference between my uniform01(gen) and generate_canonical<double,numeric_limits<double>::digits>(gen) anyhow?
Edit: Reading through the answers I conclude that there is not THE ideal solution for my problem. Thus I decided to first make my program multi threading capable and run multiple RNG in different threads (seeded with one random_device number + an thread individual increment). For the time being this seams to be the most unavoidable step (multi threading would be required anyhow). As a further step, pending on exact requirements I consider switching to the suggested Intel RNG or to Thrust. Meaning that my RNG implementation should not be to complex, which, currently is is not. But for now I like to focus on the physical correctness of my model and not on programming stuff, this comes as soon as the output of my program is physically correct.
Thrust
Concerning Intel RNG
Here is what I do currently:
class Generator {
public:
Generator();
virtual ~Generator();
double rand01(); //random number [0,1)
int binomial(int n, double p); //binomial distribution with n samples with probability p
private:
std::random_device randev; //seed
/*Engines*/
std::mt19937_64 gen;
//std::mt19937 gen;
//std::default_random_engine gen;
/*Distributions*/
std::uniform_real_distribution<double> uniform01;
std::binomial_distribution<> binomialdist;
};
Generator::Generator() : randev(), gen(randev()), uniform01(0.,1.), binomial(1,1.) {
}
Generator::~Generator() { }
double Generator::rand01() {
//return uniform01(gen);
return generate_canonical<double,numeric_limits<double>::digits>(gen);
}
int Generator::binomialdist(int n, double p) {
binomial.param(binomial_distribution<>::param_type(n,p));
return binomial(gen);
}
You can pre-process random numbers and use them when you need.
If you need true random numbers I suggest you to use a service like http://www.random.org/ that ensures random numbers calculated by environment ambient instead that some algorithm.
And, speaking about random numbers, you must also check this:
If you need a massive amount of random numbers, and I mean MASSIVE, do a careful search on the internet for IBM's floating point random number generator, published maybe ten years ago. You'll have to buy either a PowerPC machine, or a newer Intel machine with fused multiply-add. They achieved random numbers at a rate of one per cycle per core. So if you bought a new Mac Pro, you could achieve probably 50 billion random numbers per second.
Perhaps instead of using a CPU you could use a GPU to generate many numbers concurrently?
Efficient Random Number Generation and Application Using CUDA
On my i3, the following program runs in about five seconds:
#include <random>
std::mt19937_64 foo;
double drand() {
union {
double d;
long long l;
} x;
x.d = 1.0;
x.l |= foo() & (1LL<<53)-1;
return x.d-1;
}
int main() {
double d;
for (int i = 0; i < 1e9; i++)
d += drand();
printf("%g\n", d);
}
whereas replacing the drand() call with the following results in a program that runs in about ten seconds:
double drand2() {
return std::generate_canonical<double,
std::numeric_limits<double>::digits>(foo);
}
Using the following instead of drand() also results in a program that runs in about ten seconds:
std::uniform_real_distribution<double> uni;
double drand3() {
return uni(foo);
}
Perhaps the hacky drand() above suits your purposes better than the standard solutions..
Task Definition
OP asks to get answer for both the
1. Speed of generation -- assuming a set of 10E+012 random numbers to be "massive"
and
2. Quality of generator -- with a weak assumption that numbers just evenly distributed over some range of values are also random
However, there are more cardinal aspects to be addressed and successfully solved for the real system:
A. Define, whether your system simulation needs to be provided with a guarantee of a repeatability of the sequence of the random numbers for future re-runs of an experiment.
If this is not the case, the re-runs of the simulated experiment will yield principally different results then the randomizer process ( or pre-randomizer and randomized-selector ) need not worry about their re-entrant, state-full mode of operation and will get much simpler implementation.
B. Define, to what level do you need to proof a quality of randomness of the generated random numbers ( or does the generated sets of random numbers have to belong to some specific law of statistic theory ( some known synthetic distributions or truly random with an utmost Kolmogorov complexity of the resulting set of random numbers )). One need not be NSA expert to state that numerical generators of true-random sequences is a very hard issue and has it's computational costs associated with production of high-randomness products.
Hyper-chaotic and true-random sequences are computationally extemely expensive. Using low- or poor-randomness generators is not an option for randomness-quality sensitive applications ( whatever the marketing papers may say, no MIL-STD- or NSA-graded system will ever try this compromised quality in enviroments, where the results indeed matter, so why to settle for less in scientific simulations? Perhaps not a problem if you do not mind to miss so many "unvisited" states of the simulated phenomena ).
C. Verify, how many random numbers does your simulation system need to "consume per [usec]" and whether this design requirement parameter is constant or may get scaled-up by going into multi-threaded, vectorised, Grid-/Cloud-based distributed computation framework.
D. Does your simulation system require to maintain a global or per-thread- or perGrid/CloudNode- individual access management to the pool-of-randomized numbers in case of vectorized or Grid/Cloud-based computational strategy.
Task Solution Approach
Fastest [1] and best [2] solution with [A] and [B] solved and options for [D] is to pre-generate an utmost randomness quality numbers into an adequate access-pool ( and pay an acceptable cost of [C] and [D] on access-policy and access-management controls to re-read from the pool, rather than to re-generate ).

Boost vs. .Net random number generators

I developed the same algorithm (Baum-Welch for estimating parameters of a hidden Markov model) both in F# (.Net) and C++. In both cases I developed the same test that generates random test data with known distribution and then uses the algorithm to estimate the parameters, and makes sure it converges to the known right answer.
The problem is that the test works fine in the F# case, but fails to converge in the C++ implementation. I compared both algorithms on some real-world data and they give the same results, so my guess is that the generation of the test data is broken in the C++ case. Hence my question: What is the random number generator that comes with .Net 4 (I think this is the default version with VS2010)?
In F# I am using:
let random = new Random()
let randomNormal () = //for a standard normal random variable
let u1 = random.NextDouble()
let u2 = random.NextDouble()
let r = sqrt (-2. * (log u1))
let theta = 2. * System.Math.PI * u2
r * (sin theta)
//random.NextDouble() for uniform random variable on [0-1]
In C++ I use the standard Boost classes:
class HmmGenerator
{
public:
HmmGenerator() :
rng(37), //the seed does change the result, but it doesn't make it work
normalGenerator(rng, boost::normal_distribution<>(0.0, 1.0)),
uniformGenerator(rng, boost::uniform_01<>()) {}//other stuff here as well
private:
boost::mt19937 rng;
boost::variate_generator<boost::mt19937&,
boost::normal_distribution<> > normalGenerator;
boost::variate_generator<boost::mt19937&,
boost::uniform_01<> > uniformGenerator;
};
Should I expect different results using these two ways of generating random numbers?
EDIT: Also, is the generator used in .Net available in Boost (ideally with the same parameters), so I could run it in C++ and compare the outcomes?
Hence my question: What is the random number generator that comes with .Net 4 (I think this is the default version with VS2010)?
From the documentation on Random
The current implementation of the Random class is based on Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. "The Art of Computer Programming, volume 2: Seminumerical Algorithms". Addison-Wesley, Reading, MA, second edition, 1981.
.
Should I expect different results using these two ways of generating random numbers?
The Mersenne-Twister algorithm you're using in C++ is considered very respectable, compared to other off-the-shelf random generators.
I suspect any discrepancy in your codes lie elsewhere.

Questions with using boost for generating normal random numbers

I was hoping to learning how to generate numbers from normal distribution in C++ when I saw This Post. It gives a very good example, but still I am not sure what the & in boost::variate_generator<boost::mt19937&, boost::normal_distribution<> > var_nor(rng, nd); means. What effect will it produce if I did not include this & here?
Also, when reading the tutorial on Boost's official website, I found that after generating a distribution object with boost::random::uniform_int_distribution<> dist(1, 6), they were able to directly generate random numbers with it by calling dist(gen)(gen here is the random engine), without invoking the "variate_generator" object. Of course, this is for generating uniform random numbers, but I am curious if I can do the same with normal distribution, as an alternative way to calling "variate_generator"?
Short background information
One approach to generate random numbers with a specific distribution, is to generate uniformly distributed random numbers from the interval [0, 1), for example, and then apply some maths on these numbers to shape them into the desired distribution. So you have two objects: one generator for random numbers from [0, 1) and one distribution object, which
takes uniformly distributed random numbers and spits out random numbers in the desired (e.g. the normal) distribution.
Why passing the generator by reference
The var_nor object in your code couples the generator rnd with the normal distribution nd. You have to pass your generator via reference, which is the & in the template argument. This is really essential, because the random number generator has an internal state from which it computes the next (pseudo-)random number. If you would not pass the generator via reference, you would create a copy of it and this might lead to code, which always creates the same random number. See this blog post as an example.
Why the variate_generator is necessary
Now to the part, why not to use the distribution directly with the generator. If you try the following code
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/normal_distribution.hpp>
#include <iostream>
int main()
{
boost::mt19937 generator;
boost::normal_distribution<> distribution(0.0, 1.0);
// WARNING: THIS DOES NOT WORK AS MIGHT BE EXPECTED!!
for (int i = 0; i < 100; ++i)
std::cout << distribution(generator) << std::endl;
return 0;
}
you will see, that it outputs NaNs only (I've tested it with Boost 1.46). The reason is that the Mersenne twister returns a uniformly distributed integer random number. However, most (probably even all) continuous distributions require floating point random numbers from the range [0, 1). The example given in the Boost documentation works because uniform_int_distribution is a discrete distribution and thus can deal with integer RNGs.
Note: I have not tried the code with a newer version of Boost. Of course, it would be nice if the compiler threw an error if a discrete RNG is used together with a continuous distributuon.

Random numbers from Beta distribution, C++

I've written a simulation in C++ that generates (1,000,000)^2 numbers from a specific probability distribution and then does something with them. So far I've used Exponential, Normal, Gamma, Uniform and Poisson distributions. Here is the code for one of them:
#include <boost/random.hpp>
...main...
srand(time(NULL)) ;
seed = rand();
boost::random::mt19937 igen(seed) ;
boost::random::variate_generator<boost::random::mt19937, boost::random::normal_distribution<> >
norm_dist(igen, boost::random::normal_distribution<>(mu,sigma)) ;
Now I need to run it for the Beta distribution. All of the distributions I've done so far took 10-15 hours. The Beta distribution is not in the boost/random package so I had to use the boost/math/distributions package. I found this page on StackOverflow which proposed a solution. Here it is (copy-pasted):
#include <boost/math/distributions.hpp>
using namespace boost::math;
double alpha, beta, randFromUnif;
//parameters and the random value on (0,1) you drew
beta_distribution<> dist(alpha, beta);
double randFromDist = quantile(dist, randFromUnif);
I replicated it and it worked. The run time estimates of my simulation are linear and accurately predictable. They say that this will run for 25 days. I see two possibilities:
1. the method proposed is inferior to the one I was using previously for other distributions
2. the Beta distribution is just much harder to generate random numbers from
Bare in mind that I have below minimal understanding of C++ coding, so the questions I'm asking may be silly. I can't wait for a month for this simulation to complete, so is there anything I can do to improve that? Perhaps use the initial method that I was using and modify it to work with the boost/math/distributions package? I don't even know if that's possible.
Another piece of information that may be useful is that the parameters are the same for all (1,000,000)^2 of the numbers that I need to generate. I'm saying this because the Beta distribution does have a nasty PDF and perhaps the knowledge that the parameters are fixed can somehow be used to simplify the process? Just a random guess.
The beta distribution is related to the gamma distribution. Let X be a random number drawn from Gamma(α,1) and Y from Gamma(β,1), where the first argument to the gamma distribution is the shape parameter. Then Z=X/(X+Y) has distribution Beta(α,β). With this transformation, it should only take twice as much time as your gamma distribution test.
Note: The above assumes the most common representation of the gamma distribution, Gamma(shape,scale). Be aware that different implementations of the gamma distribution random generator will vary with the meaning and order of the arguments.
If you want a distribution that is very Beta-like, but has a very simple closed-form inverse CDF, it's worth considering the Kumaraswamy distribution:
http://en.wikipedia.org/wiki/Kumaraswamy_distribution
It's used as an alternative to the Beta distribution when a large number of random samples are required quickly.
Try compiling with optimization. Using a flag -O3 will usually speed things up. See this post on optimisation flags or this overview for slightly more detail.