Reinventing The Wheel: Random Number Generator - c++

So I'm new to C++ and am trying to learn some things. As such I am trying to make a Random Number Generator (RNG or PRNG if you will). I have basic knowledge of RNGs, like you have to start with a seed and then send the seed through the algorithm. What I'm stuck at is how people come up with said algorithms.
Here is the code I have to get the seed.
int getSeed()
{
time_t randSeed;
randSeed = time(NULL);
return randSeed;
}
Now I know that there is are prebuilt RNGs in C++ but I'm looking to learn not just copy other people's work and try to figure it out.
So if anyone could lead me to where I could read or show me examples of how to come up with algorithms for this I would be greatly appreciative.

First, just to clarify, any algorithm you come up with will be a pseudo random number generator and not a true random number generator. Since you would be making an algorithm (i.e. writing a function, i.e. making a set of rules), the random number generator would have to eventually repeat itself or do something similar which would be non-random.
Examples of truly random number generators are one's that capture random noise from nature and digitize it. These include:
http://www.fourmilab.ch/hotbits/
http://www.random.org/
You can also buy physical equipment that generate white noise (or some other means on randomness) and digitally capture it:
http://www.lavarnd.org/
http://www.idquantique.com/true-random-number-generator/products-overview.html
http://www.araneus.fi/products-alea-eng.html
In terms of pseudo random number generators, the easiest ones to learn (and ones that an average lay person could probably make on their own) are the linear congruential generators. Unfortunately, these are also some of the worst PRNGs there are.
Some guidelines for determining what is a good PRNG include:
Periodicity (what is the range of available numbers?)
Consecutive numbers (what is the probability that the same number will be repeated twice in a row)
Uniformity (Is it just as likely to pick numbers from a certain sub range as another sub range)
Difficulty in reverse engineering it (If it is close to truly random then someone should not be able to figure out the next number it generates based on the last few numbers it generated)
Speed (how fast can I generate a new number? Does it take 5 or 500 arithmetic operations)
I'm sure there are others I am missing
One of the more popular ones right now that is considered good in most applications (ie not crptography) is the Mersenne Twister. As you can see from the link, it is a simple algorithm, perhaps only 30 lines of code. However trying to come up with those 20 or 30 lines of code from scratch takes a lot of brainpower and study of PRNGs. Usually the most famous algorithms are designed by a professor or industry professional that has studied PRNGs for decades.
I hope you do study PRNGs and try to roll your own (try Knuth's Art of Computer Programming or Numerical Recipes as a starting place), but I just wanted to lay this all out so at the end of the day (unless PRNGs will be your life's work) its much better to just use something someone else has come up with. Also, along those lines, I'd like to point out that historically compilers, spreadsheets, etc. don't use what most mathematicians consider good PRNGs so if you have a need for a high quality PRNGs don't use the standard library one in C++, Excel, .NET, Java, etc. until you have research what they are implementing it with.

A linear congruential generator is commonly used and the Wiki article explains it pretty well.

To quote John von Neumann:
Anyone who considers arithmetical
methods of producing random digits is
of course in a state of sin.
This is taken from Chapter 3 Random Numbers of Knuth's book "The Art of Computer Programming", which must be the most exhaustive overview of the subject available. And once you have read it, you will be exhausted. You will also know why you don't want to write your own random number generator.

The correct solution best fulfills the requirements and the requirements of every situation will be unique. This is probably the simplest way to go about it:
Create a large one dimensional array
populated with "real" random values.
"seed" your pseudo-random generator by
calculating the starting index with
system time.
Iterate through the array and return
the value for each call to your
function.
Wrap around when it reaches the end.

Related

How should I choose parameters for a smaller-than-standard std::mersenne_twister_engine?

I need a C++11 random number generator which is "good enough" and which I can save and restore state in. I want the saved state to be significantly smaller than the 6.6kb or so which this code produces
std::mt19937 rng (1);
std::ofstream save ("save.txt");
save << rng;
std::mersenne_twister_engine has a large number of parameters. It's a bit scary.
For my purposes, a period on the order of billions is sufficient. I've heard of TinyMT, that may be appropriate but can't see how to implement it as a template specialization.
How should I choose the parameters? I suspect it will break badly if I merely reduce the "state size" parameter to a few words.
I would consider using a different engine entirely but, apart from tolerating a moderate period, I don't want to sacrifice the quality of statistical randomness. Artefacts such as the below (for linear congruentals) are unacceptable.
If don't need a lot of numbers, any decent 64bit size RNG will be good. Out of top of my hat very good generator would be XorShift64*, paper http://arxiv.org/abs/1402.6246, code https://github.com/Iwan-Zotow/xorshift64STAR
Another option to use is PCG, "Quadratisch. Praktisch. Gut.", paper and code at http://www.pcg-random.org/
They are both statistically better than MT, the only disadvantage being small(er) period, but it is ok with you as far as I can see
There are many good generators with a small state: MRG32k3a, LFSR113, Chacha-8, Philox-32x4. Even Mixmax (with N=17) would be small by your standard (state of 17 doubles).
TinyMT is also a possibility, although Vigna has shown that some of the bits are not always good (not sure if the not so great lower bits really matters in practice).
I would be wary of xorshift based rngs, see the paper Again, random numbers fall mainly in the planes: xorshift128+ generators by Matsumoto for example. I am also dubious of PCG, if only for the colored table on frontpage of the website: it dumbs things down too much, does not present all the relevant generators, and is skewed towards PCG of course.

Usefulness of `rand()` - or who should call `srand()`?

Background: I use rand(), std::rand(), std::random_shuffle() and other functions in my code for scientific calculations. To be able to reproduce my results, I always explicitly specify the random seed, and set it via srand(). That worked fine until recently, when I figured out that libxml2 would also call srand() lazily on its first usage - which was after my early srand() call.
I filled in a bug report to libxml2 about its srand() call, but I got the answer:
Initialize libxml2 first then.
That's a perfectly legal call to be made from a library. You should
not expect that nobody else calls srand(), and the man page nowhere
states that using srand() multiple time should be avoided
This is actually my question now. If the general policy is that every lib can/should/will call srand(), and I can/might also call it here and there, I don't really see how that can be useful at all. Or how is rand() useful then?
That is why I thought, the general (unwritten) policy is that no lib should ever call srand() and the application should call it only once in the beginning. (Not taking multi-threading into account. I guess in that case, you anyway should use something different.)
I also tried to research a bit which other libraries actually call srand(), but I didn't find any. Are there any?
My current workaround is this ugly code:
{
// On the first call to xmlDictCreate,
// libxml2 will initialize some internal randomize system,
// which calls srand(time(NULL)).
// So, do that first call here now, so that we can use our
// own random seed.
xmlDictPtr p = xmlDictCreate();
xmlDictFree(p);
}
srand(my_own_seed);
Probably the only clean solution would be to not use that at all and only to use my own random generator (maybe via C++11 <random>). But that is not really the question. The question is, who should call srand(), and if everyone does it, how is rand() useful then?
Use the new <random> header instead. It allows for multiple engine instances, using different algorithms and more importantly for you, independent seeds.
[edit]
To answer the "useful" part, rand generates random numbers. That's what it's good for. If you need fine-grained control, including reproducibility, you should not only have a known seed but a known algorithm. srand at best gives you a fixed seed, so that's not a complete solution anyway.
Well, the obvious thing has been stated a few times by others, use the new C++11 generators. I'm restating it for a different reason, though.
You use the output for scientific calculations, and rand usually implements a rather poor generator (in the mean time, many mainstream implementations use MT19937 which apart from bad state recovery isn't so bad, but you have no guarantee for a particular algorithm, and at least one mainstream compiler still uses a really poor LCG).
Don't do scientific calculations with a poor generator. It doesn't really matter if you have things like hyperplanes in your random numbers if you do some silly game shooting little birds on your mobile phone, but it matters big time for scientific simulations. Don't ever use a bad generator. Don't.
Important note: std::random_shuffle (the version with two parameters) may actually call rand, which is a pitfall to be aware of if you're using that one, even if you otherwise use the new C++11 generators found in <random>.
About the actual issue, calling srand twice (or even more often) is no problem. You can in principle call it as often as you want, all it does is change the seed, and consequentially the pseudorandom sequence that follows. I'm wondering why an XML library would want to call it at all, but they're right in their response, it is not illegitimate for them to do it. But it also doesn't matter.
The only important thing to make sure is that either you don't care about getting any particular pseudorandom sequence (that is, any sequence will do, you're not interested in reproducing an exact sequence), or you are the last one to call srand, which will override any prior calls.
That said, implementing your own generator with good statistical properties and a sufficiently long period in 3-5 lines of code isn't all that hard either, with a little care. The main advantage (apart from speed) is that you control exactly where your state is and who modifies it.
It is unlikely that you will ever need periods much longer than 2128 because of the sheer forbidding time to actually consume that many numbers. A 3GHz computer consuming one number every cycle will run for 1021 years on a 2128 period, so there's not much of an issue for humans with average lifespans. Even assuming that the supercomputer you run your simulation on is a trillion times faster, your grand-grand-grand children won't live to see the end of the period.
Insofar, periods like 219937 which current "state of the art" generators deliver are really ridiculous, that's trying to improve the generator at the wrong end if you ask me (it's better to make sure they're statistically firm and that they recover quickly from a worst-case state, etc.). But of course, opinions may differ here.
This site lists a couple of fast generators with implementations. They're xorshift generators combined with an addition or multiplication step and a small (from 2 to 64 machine words) lag, which results in both fast and high quality generators (there's a test suite as well, and the site's author wrote a couple of papers on the subject, too). I'm using a modification of one of these (the 2-word 128-bit version ported to 64-bits, with shift triples modified accordingly) myself.
This problem is being tackled in C++11's random number generation, i.e. you can create an instance of a class:
std::default_random_engine e1
which allows you to fully control only random numbers generated from object e1 (as opposed to whatever would be used in libxml). The general rule of thumb would then be to use new construct, as you can generate your random numbers independently.
Very good documentation
To address your concerns - I also think that it would be a bad practice to call srand() in a library like libxml. However, it's more that srand() and rand() are not designed to be used in the context you are trying to use them - they are enough when you just need some random numbers, as libxml does. However, when you need reproducibility and be sure that you are independent on others, the new <random> header is the way to go for you. So, to sum up, I don't think it's a good practice on library's side, but it's hard to blame them for doing that. Also, I could not imagine them changing that, as billion other pieces of software probably depend on it.
The real answer here is that if you want to be sure that YOUR random number sequence isn't being altered by someone else's code, you need a random number context that is private to YOUR work. Note that calling srand is only one small part of this. For example, if you call some function in some other library that calls rand, it will also disrupt the sequence of YOUR random numbers.
In other words, if you want predictable behaviour from your code, based on random number generation, it needs to be completely separate from any other code that uses random numbers.
Others have suggested using the C++ 11 random number generation, which is one solution.
On Linux and other compatible libraries, you could also use rand_r, which takes a pointer to an unsigned int to a seed that is used for that sequence. So if you initialize that a seed variable, then use that with all calls to rand_r, it will be producing a unique sequence for YOUR code. This is of course still the same old rand generator, just a separate seed. The main reason I meantion this is that you could fairly easily do something like this:
int myrand()
{
static unsigned int myseed = ... some initialization of your choice ...;
return rand_r(&myseed);
}
and simply call myrand instead of std::rand (and should be doable to work into the std::random_shuffle that takes a random generator parameter)

Random number distributions c++11

I have a program that uses several distribution objects:
Like:
std::normal_distribution
std::exponential_distribution
etc..
Should I use a random number engine for each one, or should make all of them share a same generator?
Usually you want every instantiation of a distribution to represent an uncorrelated stochastic variable, which means that you should a new engine for each one. Should your stochastic variables be correlated, you should rather introduce the correlation yourself instead of reusing a random engine, so that you can ensure it is correctly modeled.
Sometimes, you can cheat a tiny bit by seeding only a single (and used only for this) random engine and use that to seed the other random engines.
Should you not care about ensuring that your stochastic variables are uncorrelated (e.g. you are not doing any scientific work, but programming a game) you may ignore this, since it usually does not matter.
The usual answer is, of course, it depends.
If you're trying to do simulation work and will be pulling lots of random numbers, it's better to use a different engine for each. Otherwise, it doesn't much matter.
There is no reason to use multiple engines. If you are going to draw A LOT of random numbers and you think the results will seem correlated, change the engine to one with bigger cycle length.

How bad rand from stdlib.h is?

I'm making a monte carlo simulation in C++ and I was using Boost for random numbers. I used GSL a bit too. But it turns out random number generation is one of my biggest runtime inefficiencies, so I just started using good old rand() from cstdlib.
How badly am I risking to have poor random number properties on my simulation? I use around 10^6 or 10^7 random number samples.
There are two issues: (1) because RAND_MAX is only guaranteed to be at least 32767, there might not be many possible values (not necessarily bad for some applications), and (2) poor implementations.
If you need what is known as a secure random number generator you will need to look somewhere else. But for many apps, rand() is sufficient.
A blog post that addresses your concerns is http://eternallyconfuzzled.com/arts/jsw_art_rand.aspx.

What is the periodicity of the PRNG in GNU C Library?

Is there any literature available on the periodicity of the random number generator in gcc's g++ (if we don't re-seed the function)? I suppose I could run tests myself, but it would be better to have access to well-verified research.
Thanks in advance for your help.
// EDIT
I just wanted to add that I have searched quite a bit, with multiple engines, but I have not found anything specific. I have only read general comments about the periodicity being limited by the number of bits needed to represent the seed. (So I guess that given the fact that srand is usually called with time, the periodicity can be no more than 10^12 or so. But something more definite would be very helpful before I start implementing my algorithms.)
When searching in the rand(3) man page, I found this:
The versions of rand() and srand() in
the Linux C Library use the same
random number generator as random()
and srandom()
so I had a look at the random(3) man page, and here is your answer:
The period of this random number
generator is very large, approximately
16*((2**31)-1)
This can be quite useful for pedagogic purposes, since you want to develop your own PRNG. However, I would discourage you to use this PRNG when developing an application. You should prefer one of the Boost.Random's implementation as suggested #Neil Butterworth (MT19937 is a good default PRNG, sufficient for most of the applications).
Finally, if you intend to learn more about PRNGs, I would suggest you to read these two scientific articles, that well survey PRNGs.
Practical distribution of random streams for stochastic High Performance Computing, David RC Hill, in International Conference on High Performance Computing and Simulation (HPCS), 2010
Pseudorandom number generators, Pierre L'Ecuyer, in Encyclopedia of Quantitative Finance
Encyclopedia of Quantitative Finance, 2008
The srand/rand functions are a bit broken. As you are using C++, I strongly recommend you use the boost random number library. It's a header-only library, so you don't need to build anything. There's an example of how to use it here.