Unpredictable pseudo-RNG - c++

I'm working on a cryptography project in C++ for school, and I'm going to need a way to generate random numbers that can't be regenerated by someone else (who "guessed" the seed).
To be precise, I'd need either a pure random generator, or a way to get 100% "secure" seed. I've already done some research and thinking, and I've found two ways I could do it, the first way of doing it would be initialising the seed with the current time, but this leaves me to worry that the "hacker" might find out the moment of the generation of the key, and then they'll have the seed, and therefore will be able to predict the next generated numbers. The second way of doing it I found was to ask the user for a seed.
Now, what if I don't want the user to generate the key ? And are my worries about the time-based seed founded or are they just pure paranoia ? Is there a chance anyone could get the execution moment for the code ? Or are there maybe other ways of doing it that I've missed ?
Sidenote: I'm using the random_default_engine from <random>

user1095108 had the right idea, but the comment probably was too short.
Ask the user to type something at random. Each character is about 1 bit of randomness. Users are pretty bad at choosing random characters. Yet, you'll need about 40-50 bits.
However, users are also pretty bad at typing at an exact rhythm. The timing of each keystroke adds an extra few bits of randomness, depending on how accurately your OS can report that. With millisecond resolution, 10 keystrokes should be enough.

Related

Is there a way to check if std::random_device is in fact random?

Quoting from cppreference:
std::random_device is a non-deterministic random number engine, although implementations are allowed to implement std::random_device using a pseudo-random number engine if there is no support for non-deterministic random number generation.
Is there a way to check whether current implementation uses PRNG instead of RNG (and then say exit with an error) and if not, why not?
Note that little bit of googling shows that at least MinGW implements std::random_device in this way, and thus this is real danger if std::random_device is to be used.
---edit---
Also, if the answer is no and someone could give some insight as to why there is no such function/trait/something I would be quite interested.
Is there a way to check whether current implementation uses PRNG instead of RNG (and then say exit with an error) and if not, why not?
There is a way: std::random_device::entropy will return 0.0 if it is implemented in terms of a random number engine (that is, it's deterministic).
From the standard:
double entropy() const noexcept;
Returns: If the implementation employs a random number engine, returns 0.0. Otherwise, returns an entropy estimate for the random numbers returned by operator(), in the range min() to log_2(max() + 1).
There is no 100% safe way to determine real randomness for sure. With a black box approach the best you could do do is to show evidence if it's not fully random:
first you could verify that the distribution seems random, by generating a lot of random munmbers and making statistics about their distribution (e.g. generate 1 million random numbers between 0 and 1000). If it appears that some numbers come out significantly more often than other, then obviously it's not really random.
THe next you can is to run several time a programme generating random numbers after the same initial seed. If you obtain the same sequence of random numbers then it's definitively PRNG and not real randmness. However, if you don't obtain the same sequence it does not proove tanything: the library could use some kind of auto-seed (using clock ticks or something else) to hide/improve the pseudo-randmness.
If your application highly depends on randomness quality (e.g. cryptographic quality) you should consider some more tests, such as those recommended by NIST SP 800-22
Xarn stated above:
However, said pessimism also precludes this method from differentiating between RNG and PRNG based implementation, making it rather unhelpful. Also VC++ could be realistic, but to check that would probably require a lot of insider knowledge about Windows.
If you debug into the Windows implementation, then you will find that you end up in RtlGenRandom, which is one of the better sources of cryptographically random bytes. If you debug into the Linux implementation, then you should end up reading from dev/urandom, which is also OK. The fact that they don't tell us that we're not using something awful, like rand, is annoying.
PS - you don't have to have internal Windows knowledge, you just need to attach the symbols to the debugger.

Usefulness of `rand()` - or who should call `srand()`?

Background: I use rand(), std::rand(), std::random_shuffle() and other functions in my code for scientific calculations. To be able to reproduce my results, I always explicitly specify the random seed, and set it via srand(). That worked fine until recently, when I figured out that libxml2 would also call srand() lazily on its first usage - which was after my early srand() call.
I filled in a bug report to libxml2 about its srand() call, but I got the answer:
Initialize libxml2 first then.
That's a perfectly legal call to be made from a library. You should
not expect that nobody else calls srand(), and the man page nowhere
states that using srand() multiple time should be avoided
This is actually my question now. If the general policy is that every lib can/should/will call srand(), and I can/might also call it here and there, I don't really see how that can be useful at all. Or how is rand() useful then?
That is why I thought, the general (unwritten) policy is that no lib should ever call srand() and the application should call it only once in the beginning. (Not taking multi-threading into account. I guess in that case, you anyway should use something different.)
I also tried to research a bit which other libraries actually call srand(), but I didn't find any. Are there any?
My current workaround is this ugly code:
{
// On the first call to xmlDictCreate,
// libxml2 will initialize some internal randomize system,
// which calls srand(time(NULL)).
// So, do that first call here now, so that we can use our
// own random seed.
xmlDictPtr p = xmlDictCreate();
xmlDictFree(p);
}
srand(my_own_seed);
Probably the only clean solution would be to not use that at all and only to use my own random generator (maybe via C++11 <random>). But that is not really the question. The question is, who should call srand(), and if everyone does it, how is rand() useful then?
Use the new <random> header instead. It allows for multiple engine instances, using different algorithms and more importantly for you, independent seeds.
[edit]
To answer the "useful" part, rand generates random numbers. That's what it's good for. If you need fine-grained control, including reproducibility, you should not only have a known seed but a known algorithm. srand at best gives you a fixed seed, so that's not a complete solution anyway.
Well, the obvious thing has been stated a few times by others, use the new C++11 generators. I'm restating it for a different reason, though.
You use the output for scientific calculations, and rand usually implements a rather poor generator (in the mean time, many mainstream implementations use MT19937 which apart from bad state recovery isn't so bad, but you have no guarantee for a particular algorithm, and at least one mainstream compiler still uses a really poor LCG).
Don't do scientific calculations with a poor generator. It doesn't really matter if you have things like hyperplanes in your random numbers if you do some silly game shooting little birds on your mobile phone, but it matters big time for scientific simulations. Don't ever use a bad generator. Don't.
Important note: std::random_shuffle (the version with two parameters) may actually call rand, which is a pitfall to be aware of if you're using that one, even if you otherwise use the new C++11 generators found in <random>.
About the actual issue, calling srand twice (or even more often) is no problem. You can in principle call it as often as you want, all it does is change the seed, and consequentially the pseudorandom sequence that follows. I'm wondering why an XML library would want to call it at all, but they're right in their response, it is not illegitimate for them to do it. But it also doesn't matter.
The only important thing to make sure is that either you don't care about getting any particular pseudorandom sequence (that is, any sequence will do, you're not interested in reproducing an exact sequence), or you are the last one to call srand, which will override any prior calls.
That said, implementing your own generator with good statistical properties and a sufficiently long period in 3-5 lines of code isn't all that hard either, with a little care. The main advantage (apart from speed) is that you control exactly where your state is and who modifies it.
It is unlikely that you will ever need periods much longer than 2128 because of the sheer forbidding time to actually consume that many numbers. A 3GHz computer consuming one number every cycle will run for 1021 years on a 2128 period, so there's not much of an issue for humans with average lifespans. Even assuming that the supercomputer you run your simulation on is a trillion times faster, your grand-grand-grand children won't live to see the end of the period.
Insofar, periods like 219937 which current "state of the art" generators deliver are really ridiculous, that's trying to improve the generator at the wrong end if you ask me (it's better to make sure they're statistically firm and that they recover quickly from a worst-case state, etc.). But of course, opinions may differ here.
This site lists a couple of fast generators with implementations. They're xorshift generators combined with an addition or multiplication step and a small (from 2 to 64 machine words) lag, which results in both fast and high quality generators (there's a test suite as well, and the site's author wrote a couple of papers on the subject, too). I'm using a modification of one of these (the 2-word 128-bit version ported to 64-bits, with shift triples modified accordingly) myself.
This problem is being tackled in C++11's random number generation, i.e. you can create an instance of a class:
std::default_random_engine e1
which allows you to fully control only random numbers generated from object e1 (as opposed to whatever would be used in libxml). The general rule of thumb would then be to use new construct, as you can generate your random numbers independently.
Very good documentation
To address your concerns - I also think that it would be a bad practice to call srand() in a library like libxml. However, it's more that srand() and rand() are not designed to be used in the context you are trying to use them - they are enough when you just need some random numbers, as libxml does. However, when you need reproducibility and be sure that you are independent on others, the new <random> header is the way to go for you. So, to sum up, I don't think it's a good practice on library's side, but it's hard to blame them for doing that. Also, I could not imagine them changing that, as billion other pieces of software probably depend on it.
The real answer here is that if you want to be sure that YOUR random number sequence isn't being altered by someone else's code, you need a random number context that is private to YOUR work. Note that calling srand is only one small part of this. For example, if you call some function in some other library that calls rand, it will also disrupt the sequence of YOUR random numbers.
In other words, if you want predictable behaviour from your code, based on random number generation, it needs to be completely separate from any other code that uses random numbers.
Others have suggested using the C++ 11 random number generation, which is one solution.
On Linux and other compatible libraries, you could also use rand_r, which takes a pointer to an unsigned int to a seed that is used for that sequence. So if you initialize that a seed variable, then use that with all calls to rand_r, it will be producing a unique sequence for YOUR code. This is of course still the same old rand generator, just a separate seed. The main reason I meantion this is that you could fairly easily do something like this:
int myrand()
{
static unsigned int myseed = ... some initialization of your choice ...;
return rand_r(&myseed);
}
and simply call myrand instead of std::rand (and should be doable to work into the std::random_shuffle that takes a random generator parameter)

How to generate good random seed for a random generator?

I certainly can't use the random generator for that. Currently I'm creating a CRC32 hash from unixtime()+microtime().
Are there any smarter methods than hashing time()+microtime() ?
I am not fully satisfied from the results though, I expected it to be more random, but I can see strong patterns in it, until I added more calls to MicroTime() but it gets a lot slower, so I'm looking for some optimal way of doing this.
This silly code generates the best output I could make so far, the calculations were necessary or I could see some patterns in the output:
starthash(crc32);
addtohash(crc32, MicroTime());
addtohash(crc32, time(NULL)); // 64bit
addtohash(crc32, MicroTime()/13.37f);
addtohash(crc32, (10.0f-MicroTime())*1337.0f);
addtohash(crc32, (11130.0f-MicroTime())/1313137.0f);
endhash(crc32);
MicroTime() returns microseconds elapsed from program start. I have overloaded the addtohash() to every possible type.
I would rather take non-library solutions, it's just ~10 lines of code probably anyways, I don't want to install huge library because of something I don't actually need that much, and I'm more interested in the code than just using it from a function call.
If in any doubt, get your seed from CryptGenRandom on Windows, or by reading from dev/random or dev/urandom on *NIX systems.
This might be overkill for your purposes, but unless it causes performance problems there's no point messing with low-entropy sources like the time.
It's unlikely to be underkill. And if you're writing code with a real need for high-quality secure random data, and didn't bother mentioning that in the question, well, you get what you deserve ;-)
you can check for lfsr & pseudorandom generators.. usually this is a hardwre solution but you can implement easily your own software lfsr

Reinventing The Wheel: Random Number Generator

So I'm new to C++ and am trying to learn some things. As such I am trying to make a Random Number Generator (RNG or PRNG if you will). I have basic knowledge of RNGs, like you have to start with a seed and then send the seed through the algorithm. What I'm stuck at is how people come up with said algorithms.
Here is the code I have to get the seed.
int getSeed()
{
time_t randSeed;
randSeed = time(NULL);
return randSeed;
}
Now I know that there is are prebuilt RNGs in C++ but I'm looking to learn not just copy other people's work and try to figure it out.
So if anyone could lead me to where I could read or show me examples of how to come up with algorithms for this I would be greatly appreciative.
First, just to clarify, any algorithm you come up with will be a pseudo random number generator and not a true random number generator. Since you would be making an algorithm (i.e. writing a function, i.e. making a set of rules), the random number generator would have to eventually repeat itself or do something similar which would be non-random.
Examples of truly random number generators are one's that capture random noise from nature and digitize it. These include:
http://www.fourmilab.ch/hotbits/
http://www.random.org/
You can also buy physical equipment that generate white noise (or some other means on randomness) and digitally capture it:
http://www.lavarnd.org/
http://www.idquantique.com/true-random-number-generator/products-overview.html
http://www.araneus.fi/products-alea-eng.html
In terms of pseudo random number generators, the easiest ones to learn (and ones that an average lay person could probably make on their own) are the linear congruential generators. Unfortunately, these are also some of the worst PRNGs there are.
Some guidelines for determining what is a good PRNG include:
Periodicity (what is the range of available numbers?)
Consecutive numbers (what is the probability that the same number will be repeated twice in a row)
Uniformity (Is it just as likely to pick numbers from a certain sub range as another sub range)
Difficulty in reverse engineering it (If it is close to truly random then someone should not be able to figure out the next number it generates based on the last few numbers it generated)
Speed (how fast can I generate a new number? Does it take 5 or 500 arithmetic operations)
I'm sure there are others I am missing
One of the more popular ones right now that is considered good in most applications (ie not crptography) is the Mersenne Twister. As you can see from the link, it is a simple algorithm, perhaps only 30 lines of code. However trying to come up with those 20 or 30 lines of code from scratch takes a lot of brainpower and study of PRNGs. Usually the most famous algorithms are designed by a professor or industry professional that has studied PRNGs for decades.
I hope you do study PRNGs and try to roll your own (try Knuth's Art of Computer Programming or Numerical Recipes as a starting place), but I just wanted to lay this all out so at the end of the day (unless PRNGs will be your life's work) its much better to just use something someone else has come up with. Also, along those lines, I'd like to point out that historically compilers, spreadsheets, etc. don't use what most mathematicians consider good PRNGs so if you have a need for a high quality PRNGs don't use the standard library one in C++, Excel, .NET, Java, etc. until you have research what they are implementing it with.
A linear congruential generator is commonly used and the Wiki article explains it pretty well.
To quote John von Neumann:
Anyone who considers arithmetical
methods of producing random digits is
of course in a state of sin.
This is taken from Chapter 3 Random Numbers of Knuth's book "The Art of Computer Programming", which must be the most exhaustive overview of the subject available. And once you have read it, you will be exhausted. You will also know why you don't want to write your own random number generator.
The correct solution best fulfills the requirements and the requirements of every situation will be unique. This is probably the simplest way to go about it:
Create a large one dimensional array
populated with "real" random values.
"seed" your pseudo-random generator by
calculating the starting index with
system time.
Iterate through the array and return
the value for each call to your
function.
Wrap around when it reaches the end.

What's the best way to unit test code that generates random output?

Specifically, I've got a method picks n items from a list in such a way that a% of them meet one criterion, and b% meet a second, and so on. A simplified example would be to pick 5 items where 50% have a given property with the value 'true', and 50% 'false'; 50% of the time the method would return 2 true/3 false, and the other 50%, 3 true/2 false.
Statistically speaking, this means that over 100 runs, I should get about 250 true/250 false, but because of the randomness, 240/260 is entirely possible.
What's the best way to unit test this? I'm assuming that even though technically 300/200 is possible, it should probably fail the test if this happens. Is there a generally accepted tolerance for cases like this, and if so, how do you determine what that is?
Edit: In the code I'm working on, I don't have the luxury of using a pseudo-random number generator, or a mechanism of forcing it to balance out over time, as the lists that are picked out are generated on different machines. I need to be able to demonstrate that over time, the average number of items matching each criterion will tend to the required percentage.
Random and statistics are not favored in unit tests. Unit tests should always return the same result. Always. Not mostly.
What you could do is trying to remove the random generator of the logic you are testing. Then you can mock the random generator and return predefined values.
Additional thoughts:
You could consider to change the implementation to make it more testable. Try to get as less random values as possible. You could for instance only get one random value to determine the deviation from the average distribution. This would be easy to test. If the random value is zero, you should get the exact distribution you expect in average. If the value is for instance 1.0, you miss the average by some defined factor, for instance by 10%. You could also implement some Gaussian distribution etc. I know this is not the topic here, but if you are free to implement it as you want, consider testability.
According to the Statistical information you have, determine a range instead of a particular single value as a result.
Many probabilistic algorithms in e.g. scientific computing use pseudo-random number generators, instead of a true random number generator. Even though they're not truly random, a carefully chosen pseudo-random number generator will do the job just fine.
One advantage of a pseudo-random number generator is that the random number sequence they produce is fully reproducible. Since the algorithm is deterministic, the same seed would always generate the same sequence. This is often the deciding factor why they're chosen in the first place, because experiments need to be repeatable, results reproducible.
This concept is also applicable for testing. Components can be designed such that you can plug in any source of random numbers. For testing, you can then use generators that are consistently seeded. The result would then be repeatable, which is suitable for testing.
Note that if in fact a true random number is needed, you can still test it this way, as long as the component features a pluggable source of random numbers. You can re-plug in the same sequence (which may be truly random if need be) to the same component for testing.
It seems to me there are at least three distinct things you want to test here:
The correctness of the procedure that generates an output using the random source
That the distribution of the random source is what you expect
That the distribution of the output is what you expect
1 should be deterministic and you can unit test it by supplying a chosen set of known "random" values and inputs and checking that it produces the known correct outputs. This would be easiest if you structure the code so that the random source is passed as an argument rather than embedded in the code.
2 and 3 cannot be tested absolutely. You can test to some chosen confidence level, but you must be prepared for such tests to fail in some fraction of cases. Probably the thing you really want to look out for is test 3 failing much more often than test 2, since that would suggest that your algorithm is wrong.
The tests to apply will depend on the expected distribution. For 2 you most likely expect the random source to be uniformly distributed. There are various tests for this, depending on how involved you want to be, see for example Tests for pseudo-random number generators on this page.
The expected distribution for 3 will depend very much on exactly what you're producing. The simple 50-50 case in the question is exactly equivalent to testing for a fair coin, but obviously other cases will be more complicated. If you can work out what the distribution should be, a chi-square test against it may help.
That depends on the use you make of your test suite. If you run it every few seconds because you embrace test-driven development and aggressive refactoring, then it is very important that it doesn't fail spuriously, because this causes major disruption and lowers productivity, so you should choose a threshold that is practically impossible to reach for a well-behaved implementation. If you run your tests once a night and have some time to investigate failures you can be much stricter.
Under no circumstances should you deploy something that will lead to frequent uninvestigated failures - this defeats the entire purpose of having a test suite, and dramatically reduces its value to the team.
You should test the distribution of results in a "single" unit test, i.e. that the result is as close to the desired distribution as possible in any individual run. For your example, 2 true / 3 false is OK, 4 true / 1 false is not OK as a result.
Also you could write tests which execute the method e.g. 100 times and checks that the average of the distributions is "close enough" to the desired rate. This is a borderline case - running bigger batches may take a significant amount of time, so you might want to run these tests separately from your "regular" unit tests. Also, as Stefan Steinegger points out, such a test is going to fail every now and then if you define "close enough" stricter, or start being meaningless if you define the threshold too loosely. So it is a tricky case...
I think if I had the same problem I probably construct a confidence interval to detect anomalies if you have some statistics about average/stddev and such. So in your case if the average expected value is 250 then create a 95% confidence interval around the average using a normal distribution. If the results are outside that interval you fail the test.
see more
Why not re-factor the random number generation code and let the unit test framework and the source code both use it? You are trying to test your algorithm and not the randomized sequence right?
First you have to know what distribution should result from your random number generation process. In your case you are generating a result which is either 0 or 1 with probability -0.5. This describes a binomial distribution with p=0.5.
Given the sample size of n, you can construct (as an earlier poster suggested) a confidence interval around the mean. You can also make various statements about the probability of getting, for instance, 240 or less of either outcome when n=500.
You could use a normal distribution assumption for values of N greater than 20 as long as p is not very large or very small. The Wikipedia post has more on this.