Will this give me proper random numbers based on these probabilities? C++

Will this give me proper random numbers based on these probabilities? C++ - c++

Code:
int random = (rand() % 7 + 1)
if (random == 1) { } // num 1
else if (random == 2) { } // num 2
else if (random == 3 || random == 4) { } // num 3
else if (random == 5 || random == 6) { } // num 4
else if (random == 7) { } // num 5
Basically I want each of these numbers with each of these probabilities:
1: 1/7
2: 1/7
3: 2/7
4: 2/7
5: 1/7
Will this code give me proper results? I.e. if this is run infinite times, will I get the proper frequencies? Is there a less-lengthy way of doing this?

Not, it's actually slightly off, due to the way rand() works. In particular, rand returns values in the range [0,RAND_MAX]. Hypothetically, assume RAND_MAX were ten. Then rand() would give 0…10, and they'd be mapped (by modulus) to:
0 → 0
1 → 1
2 → 2
3 → 3
4 → 4
5 → 5
6 → 6
7 → 0
8 → 1
9 → 2
10 → 3
Note how 0–3 are more common than 4–6; this is bias in your random number generation. (You're adding 1 as well, but that just shifts it over).
RAND_MAX of course isn't 10, but it's probably not a multiple of 7 (minus 1), either. Most likely its a power of two. So you'll have some bias.
I suggest using the Boost Random Number Library which can give you a random number generator that yields 1–7 without bias. Look also at bames53's answer using C++11, which is the right way to do this if your code only needs to target C++11 platforms.

Just another way:
float probs[5] = {1/7.0f, 1/7.0f, 2/7.0f, 2/7.0f, 1/7.0f};
float sum = 0;
for (int i = 0; i < 5; i++)
sum += probs[i]; /* edit */
int rand_M() {
float f = (rand()*sum)/RAND_MAX; /* edit */
for (int i = 0; i < 5; i++) {
if (f <= probs[i]) return i;
f -= probs[i];
}
return 4;
}

Assuming rand() is good then your code will work with only a very small bias to the lower X numbers, where X is RAND_MAX % 7. It's much more likely that you won't get the desired odds due to the quality of the implementation of rand(). If you find that to be the case then you'll want to use an alternative random number generator.
C++11 introduces the header <random> which includes several quality RNGs. Here's an example:
#include <random>
#include <functional>
auto rand = std::bind(std::uniform_int_distribution<int>(1,7),std::mt19937());
Given this, when you call rand() you will get a number from 1 to 7 each with equal probability. (And you can choose different engines if for different quality and speed characteristics.) You can then use this to implement the if-else conditions your example currently uses with std::rand(). However <random> allows you to do even better using one of their non-uniform distributions. In this case what you want is discrete_distribution. This distribution allows you to explicitly state the weights for each value from 0 to n.
// the random number generator
auto _rand = std::bind(std::discrete_distribution<int>{1./7.,1./7.,2./7.,2./7.,1./7.},std::mt19937());
// convert results of RNG from the range [0-4] to [1-5]
auto rand = [&_rand]() { return _rand() +1; };

int toohigh = RAND_MAX - RAND_MAX%7;
int random;
do {
random = rand();
while (random >= toohigh); //should happen ~0.03% of the time
static const int results[7] = {1, 2, 3, 3, 4, 4, 5};
random = results[random%7];
This should give numbers with a distribution as even as rand can handle, and without the big if switch.
Note this does have a theoretically possible infinite loop, but the statistical odds of it staying in the loop for even are minuscule. The odds of it staying in the loop twice is quite close to the odds of winning the California Super Lotto Jackpot. Even if every person on the planet got five random numbers, it probably wouldn't stay in the loop three times. (Assuming a perfect RNG.)

rand returns pseudo-random integral number:
Notice though that this modulo operation does not generate a truly
uniformly distributed random number in the span (since in most cases
lower numbers are slightly more likely), but it is generally a good
approximation for short spans.
Now, regarding the less-lengthy way, you can use switch-case construction, or a series of conditional operators ?: (which will make your code short and unreadable:).

Related

std::mersenne_twister_engine and random number generation

What is the distribution (uniform, poisson, normal, etc.) that is generated if I did the below? The output appears to indicate a uniform distribution. But then, why do we need std::uniform_int_distribution?
int main()
{
std::mt19937_64 generator(134);
std::map<int, int> freq;
const int size = 100000;
for (int i = 0; i < size; ++i) {
int r = generator() % size;
freq[r]++;
}
for (auto f : freq) {
std::cout << std::string(f.second, '*') << std::endl;
}
return 0;
}
Thanks!

Because while generator() is an uniform distribution over [generator.min(), generator.max()], generator() % n is not a uniform distribution over [0, n) (unless generator.max() is an exact multiple of n, assuming generator.min() == 0).
Let's take an example: min() == 0, max() == 65'535 and n == 7.
gen() will give numbers in the range [0, 65'535] and in this range there are:
9'363 numbers such that gen() % 7 == 0
9'363 numbers such that gen() % 7 == 1
9'362 numbers such that gen() % 7 == 2
9'362 numbers such that gen() % 7 == 3
9'362 numbers such that gen() % 7 == 4
9'362 numbers such that gen() % 7 == 5
9'362 numbers such that gen() % 7 == 6
If you are wondering where did I get these numbers think of it like this: 65'534 is an exact multiple of 7 (65'534 = 7 * 9'362). This means that in [0, 65'533] there are exactly 9'362 numbers who map to each of the {0, 1, 2, 3, 4, 5, 6} by doing gen() % 7. This leaves 65'534 who maps to 0 and 65'535 who maps to 1
So you see there is a bias towards [0, 1] than to [2, 6], i.e.
0 and 1 have a slightly higher chance (9'363 / 65'536 ≈ 14.28680419921875 %)‬ of appearing than
2, 3, 4, 5 and 6 (9'362 / 65'536 ≈ 14.2852783203125‬ %).
std::uniformn_distribution doesn't have this problem and uses some mathematical woodo with possibly getting more random numbers from the generator to achieve a truly uniform distribution.

The random engine std::mt19937_64 outputs a 64-bit number that behaves like a uniformly distributed random number. Each of the C++ random engines (including those of the std::mersenne_twister_engine family) outputs a uniformly-distributed pseudorandom number of a specific size using a specific algorithm.
Specifically, std::mersenne_twister_engine meets the RandomNumberEngine requirement, which in turn meets the UniformRandomBitGenerator requirement; therefore, std::mersenne_twister_engine outputs bits that behave like uniformly-distributed random bits.
On the other hand, std::uniform_int_distribution is useful for transforming numbers from random engines into random integers of a user-defined range (say, from 0 through 10). But note that uniform_int_distribution and other distributions (unlike random number engines) can be implemented differently from one C++ standard library implementation to another.

std::mt19937_64 generates a pseudo-random mutually independent sequence of long long / unsigned long long numbers. It is supposed to be uniform but I don't know the exact details of the engine, though, it is one of the best discovered engines thus far.
By taking % n you get an approximation to pseudo-random uniform distribution over integers [0, ... ,n] - but it is inherently inaccurate. Certain numbers have slightly higher chance to occur while others have slightly lower chance depending on n. E.g., since 2^64 = 18446744073709551616 so with n=10000 first 1616 values have a slightly higher chance to occur than the last 10000-1616 values. std::uniform_distribution takes care of the inaccuracy by taking a new random number in very rare cases: say, if the number is above 18446744073709550000 for n=10000 take a new number - it would work. Though, concrete details are up to implementation.

One of the major accomplishments of <random> was the separation of distributions from engines.
I see it as similar to Alexander Stepanov's STL, which separated algorithms from containers through the use of iterators. For random numbers I can do an implementation of the Blum-Blum-Shub single bit generator (engine) and it will still work with all the distributions in <random>. Or, I can do a simple Linear Congruential Generator, x_{n + 1} = a * x_{n} % m, which when correctly seeded can never generate 0. Again, it will work with all the distributions. Likewise, I can write a new distribution and I don't have to worry about the peculiarities of any engine as long as I only use the interface specified by a UniformRandomBitGenerator.
In general, you should always use a distribution. Also, it is time to retire using '%' for generating random numbers.

Probability and random numbers

I am just starting with C++ and am creating a simple text-based adventure. I am trying to figure out how to have probability based events. For example a 50% chance that when you open box there will be a sword and a 50% chance it will be a knife. I know how to make a random number generator, but I don't know how to associate that number with something. I created a variation of what I want but it requires the user to input the random number. I am wondering how to base the if statement on whether or not the random number was greater or less than 50, not if the number the user put in was greater or less than 50.

Use the rest operator % with rand.
rand()%2 can give you either 0 or 1.
Lets 0 be a sword and 1 be a knife.
If you also need an axe,then use rand()%3.
It can give you 0,1 or 2.
2 represents an axe and 0 and 1 like above.
The ifs and elses are then obvious.
rand()%n where n is a big number has a higher probability to give you smaller numbers. The probability is not equally distributed. You can check out some random number generators from stl or boost.

If you use rand() it generates numbers in the range 0..RAND_MAX. So for a 50% probability you could do something like:
#include <stdlib.h>
if (rand() < RAND_MAX / 2)
{
// sword - 50% probability
}
else
{
// knife - 50% probability
}
You can obviously extend this to more than two different cases with any given probability for each case, simply by defining appropriate thresholds in the range 0..RAND_MAX for each case, e.g.
int r = rand();
if (r < RAND_MAX / 4)
{
// sword - 25% probability
}
else if (r < 3 * RAND_MAX / 4)
{
// knife - 50% probability
}
else
{
// axe - 25% probability
}

C++: What are some general ways to make code more efficient for use with large numbers?

Please when answering this question try to be as general as possible to help the wider community, rather than just specifically helping my issue (although helping my issue would be great too ;) )
I seem to be encountering this problem time and time again with the simple problems on Project Euler. Most commonly are the problems that require a computation of the prime numbers - these without fail always fail to terminate for numbers greater than about 60,000.
My most recent issue is with Problem 12:
The sequence of triangle numbers is generated by adding the natural numbers. So the 7th triangle number would be 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28. The first ten terms would be:
1, 3, 6, 10, 15, 21, 28, 36, 45, 55, ...
Let us list the factors of the first seven triangle numbers:
1: 1
3: 1,3
6: 1,2,3,6
10: 1,2,5,10
15: 1,3,5,15
21: 1,3,7,21
28: 1,2,4,7,14,28
We can see that 28 is the first triangle number to have over five divisors.
What is the value of the first triangle number to have over five hundred divisors?
Here is my code:
#include <iostream>
#include <vector>
#include <cmath>
using namespace std;
int main() {
int numberOfDivisors = 500;
//I begin by looping from 1, with 1 being the 1st triangular number, 2 being the second, and so on.
for (long long int i = 1;; i++) {
long long int triangularNumber = (pow(i, 2) + i)/2
//Once I have the i-th triangular, I loop from 1 to itself, and add 1 to count each time I encounter a divisor, giving the total number of divisors for each triangular.
int count = 0;
for (long long int j = 1; j <= triangularNumber; j++) {
if (triangularNumber%j == 0) {
count++;
}
}
//If the number of divisors is 500, print out the triangular and break the code.
if (count == numberOfDivisors) {
cout << triangularNumber << endl;
break;
}
}
}
This code gives the correct answers for smaller numbers, and then either fails to terminate or takes an age to do so!
So firstly, what can I do with this specific problem to make my code more efficient?
Secondly, what are some general tips both for myself and other new C++ users for making code more efficient? (I.e. applying what we learn here in the future.)
Thanks!

The key problem is that your end condition is bad. You are supposed to stop when count > 500, but you look for an exact match of count == 500, therefore you are likely to blow past the correct answer without detecting it, and keep going ... maybe forever.
If you fix that, you can post it to code review. They might say something like this:
Break it down into separate functions for finding the next triangle number, and counting the factors of some number.
When you find the next triangle number, you execute pow. I perform a single addition.
For counting the number of factors in a number, a google search might help. (e.g. http://www.cut-the-knot.org/blue/NumberOfFactors.shtml ) You can build a list of prime numbers as you go, and use that to quickly find a prime factorization, from which you can compute the number of factors without actually counting them. When the numbers get big, that loop gets big.

Tldr: 76576500.
About your Euler problem, some math:
Preliminary 1:
Let's call the n-th triangle number T(n).
T(n) = 1 + 2 + 3 + ... + n = (n^2 + n)/2 (sometimes attributed to Gauss, sometimes someone else). It's not hard to figure it out:
1+2+3+4+5+6+7+8+9+10 =
(1+10) + (2+9) + (3+8) + (4+7) + (5+6) =
11 + 11 + 11 + 11 + 11 =
55 =
110 / 2 =
(10*10 + 10)/2
Because of its definition, it's trivial that T(n) + n + 1 = T(n+1), and that with a<b, T(a)<T(b) is true too.
Preliminary 2:
Let's call the divisor count D. D(1)=1, D(4)=3 (because 1 2 4).
For a n with c non-repeating prime factors (not just any divisors, but prime factors, eg. n = 42 = 2 * 3 * 7 has c = 3), D(n) is c^2: For each factor, there are two possibilites (use it or not). The 9 possibile divisors for the examples are: 1, 2, 3, 7, 6 (2*3), 14 (2*7), 21 (3*7), 42 (2*3*7).
More generally with repeating, the solution for D(n) is multiplying (Power+1) together. Example 126 = 2^1 * 3^2 * 7^1: Because it has two 3, the question is no "use 3 or not", but "use it 1 time, 2 times or not" (if one time, the "first" or "second" 3 doesn't change the result). With the powers 1 2 1, D(126) is 2*3*2=12.
Preliminary 3:
A number n and n+1 can't have any common prime factor x other than 1 (technically, 1 isn't a prime, but whatever). Because if both n/x and (n+1)/x are natural numbers, (n+1)/x - n/x has to be too, but that is 1/x.
Back to Gauss: If we know the prime factors for a certain n and n+1 (needed to calculate D(n) and D(n+1)), calculating D(T(n)) is easy. T(N) = (n^2 + n) / 2 = n * (n+1) / 2. As n and n+1 don't have common prime factors, just throwing together all factors and removing one 2 because of the "/2" is enough. Example: n is 7, factors 7 = 7^1, and n+1 = 8 = 2^3. Together it's 2^3 * 7^1, removing one 2 is 2^2 * 7^1. Powers are 2 1, D(T(7)) = 3*2 = 6. To check, T(7) = 28 = 2^2 * 7^1, the 6 possible divisors are 1 2 4 7 14 28.
What the program could do now: Loop through all n from 1 to something, always factorize n and n+1, use this to get the divisor count of the n-th triangle number, and check if it is >500.
There's just the tiny problem that there are no efficient algorithms for prime factorization. But for somewhat small numbers, todays computers are still fast enough, and keeping all found factorizations from 1 to n helps too for finding the next one (for n+1). Potential problem 2 are too large numbers for longlong, but again, this is no problem here (as can be found out with trying).
With the described process and the program below, I got
the 12375th triangle number is 76576500 and has 576 divisors
#include <iostream>
#include <vector>
#include <cstdint>
using namespace std;
const int limit = 500;
vector<uint64_t> knownPrimes; //2 3 5 7...
//eg. [14] is 1 0 0 1 ... because 14 = 2^1 * 3^0 * 5^0 * 7^1
vector<vector<uint32_t>> knownFactorizations;
void init()
{
knownPrimes.push_back(2);
knownFactorizations.push_back(vector<uint32_t>(1, 0)); //factors for 0 (dummy)
knownFactorizations.push_back(vector<uint32_t>(1, 0)); //factors for 1 (dummy)
knownFactorizations.push_back(vector<uint32_t>(1, 1)); //factors for 2
}
void addAnotherFactorization()
{
uint64_t number = knownFactorizations.size();
size_t len = knownPrimes.size();
for(size_t i = 0; i < len; i++)
{
if(!(number % knownPrimes[i]))
{
//dividing with a prime gets a already factorized number
knownFactorizations.push_back(knownFactorizations[number / knownPrimes[i]]);
knownFactorizations[number][i]++;
return;
}
}
//if this failed, number is a newly found prime
//because a) it has no known prime factors, so it must have others
//and b) if it is not a prime itself, then it's factors should've been
//found already (because they are smaller than the number itself)
knownPrimes.push_back(number);
len = knownFactorizations.size();
for(size_t s = 0; s < len; s++)
{
knownFactorizations[s].push_back(0);
}
knownFactorizations.push_back(knownFactorizations[0]);
knownFactorizations[number][knownPrimes.size() - 1]++;
}
uint64_t calculateDivisorCountOfN(uint64_t number)
{
//factors for number must be known
uint64_t res = 1;
size_t len = knownFactorizations[number].size();
for(size_t s = 0; s < len; s++)
{
if(knownFactorizations[number][s])
{
res *= (knownFactorizations[number][s] + 1);
}
}
return res;
}
uint64_t calculateDivisorCountOfTN(uint64_t number)
{
//factors for number and number+1 must be known
uint64_t res = 1;
size_t len = knownFactorizations[number].size();
vector<uint32_t> tmp(len, 0);
size_t s;
for(s = 0; s < len; s++)
{
tmp[s] = knownFactorizations[number][s]
+ knownFactorizations[number+1][s];
}
//remove /2
tmp[0]--;
for(s = 0; s < len; s++)
{
if(tmp[s])
{
res *= (tmp[s] + 1);
}
}
return res;
}
int main()
{
init();
uint64_t number = knownFactorizations.size() - 2;
uint64_t DTn = 0;
while(DTn <= limit)
{
number++;
addAnotherFactorization();
DTn = calculateDivisorCountOfTN(number);
}
uint64_t tn;
if(number % 2) tn = ((number+1)/2)*number;
else tn = (number/2)*(number+1);
cout << "the " << number << "th triangle number is "
<< tn << " and has " << DTn << " divisors" << endl;
return 0;
}
About your general question about speed:
1) Algorithms.
How to know them? For (relatively) simple problems, either reading a book/Wikipedia/etc. or figuring it out if you can. For harder stuff, learning more basic things and gaining experience is necessary before it's even possible to understand them, eg. studying CS and/or maths ... number theory helps a lot for your Euler problem. (It will help less to understand how a MP3 file is compressed ... there are many areas, it's not possible to know everything.).
2a) Automated compiler optimizations of frequently used code parts / patterns
2b) Manual timing what program parts are the slowest, and (when not replacing it with another algorithm) changing it in a way that eg. requires less data send to slow devices (HDD, hetwork...), less RAM memory access, less CPU cycles, works better together with OS scheduler and memory management strategies, uses the CPU pipeline/caches better etc.etc. ... this is both education and experience (and a big topic).
And because long variables have a limited size, sometimes it is necessary to use custom types that use eg. a byte array to store a single digit in each byte. That way, it's possible to use the whole RAM for a single number if you want to, but the downside is you/someone has to reimplement stuff like addition and so on for this kind of number storage. (Of course, libs for that exist already, without writing everything from scratch).
Btw., pow is a floating point function and may get you inaccurate results. It's not appropriate to use it in this case.

Why do people say there is modulo bias when using a random number generator?

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly there is "modulo bias" when using a random number generator, like rand() in C++.

So rand() is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX, which is a constant defined in cstdlib (see this article for a general overview on rand()).
Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say RAND_MAX is 10 and I decide to generate a random number between 0 and 2 by calling rand()%3. However, rand()%3 does not produce the numbers between 0 and 2 with equal probability!
When rand() returns 0, 3, 6, or 9, rand()%3 == 0. Therefore, P(0) = 4/11
When rand() returns 1, 4, 7, or 10, rand()%3 == 1. Therefore, P(1) = 4/11
When rand() returns 2, 5, or 8, rand()%3 == 2. Therefore, P(2) = 3/11
This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.
So when does rand()%n return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1. In this case, along with our earlier assumption rand() does return a number between 0 and RAND_MAX with equal probability, the modulo classes of n would also be equally distributed.
So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:
int x;
do {
x = rand();
} while (x >= n);
but that's inefficient for low values of n, since you only have a n/RAND_MAX chance of getting a value in your range, and so you'll need to perform RAND_MAX/n calls to rand() on average.
A more efficient formula approach would be to take some large range with a length divisible by n, like RAND_MAX - RAND_MAX % n, keep generating random numbers until you get one that lies in the range, and then take the modulus:
int x;
do {
x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));
x %= n;
For small values of n, this will rarely require more than one call to rand().
Works cited and further reading:
CPlusPlus Reference
Eternally Confuzzled

Keep selecting a random is a good way to remove the bias.
Update
We could make the code fast if we search for an x in range divisible by n.
// Assumptions
// rand() in [0, RAND_MAX]
// n in (0, RAND_MAX]
int x;
// Keep searching for an x in a range divisible by n
do {
x = rand();
} while (x >= RAND_MAX - (RAND_MAX % n))
x %= n;
The above loop should be very fast, say 1 iteration on average.

#user1413793 is correct about the problem. I'm not going to discuss that further, except to make one point: yes, for small values of n and large values of RAND_MAX, the modulo bias can be very small. But using a bias-inducing pattern means that you must consider the bias every time you calculate a random number and choose different patterns for different cases. And if you make the wrong choice, the bugs it introduces are subtle and almost impossible to unit test. Compared to just using the proper tool (such as arc4random_uniform), that's extra work, not less work. Doing more work and getting a worse solution is terrible engineering, especially when doing it right every time is easy on most platforms.
Unfortunately, the implementations of the solution are all incorrect or less efficient than they should be. (Each solution has various comments explaining the problems, but none of the solutions have been fixed to address them.) This is likely to confuse the casual answer-seeker, so I'm providing a known-good implementation here.
Again, the best solution is just to use arc4random_uniform on platforms that provide it, or a similar ranged solution for your platform (such as Random.nextInt on Java). It will do the right thing at no code cost to you. This is almost always the correct call to make.
If you don't have arc4random_uniform, then you can use the power of opensource to see exactly how it is implemented on top of a wider-range RNG (ar4random in this case, but a similar approach could also work on top of other RNGs).
Here is the OpenBSD implementation:
/*
* Calculate a uniformly distributed random number less than upper_bound
* avoiding "modulo bias".
*
* Uniformity is achieved by generating new random numbers until the one
* returned is outside the range [0, 2**32 % upper_bound). This
* guarantees the selected random number will be inside
* [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
* after reduction modulo upper_bound.
*/
u_int32_t
arc4random_uniform(u_int32_t upper_bound)
{
u_int32_t r, min;
if (upper_bound < 2)
return 0;
/* 2**32 % x == (2**32 - x) % x */
min = -upper_bound % upper_bound;
/*
* This could theoretically loop forever but each retry has
* p > 0.5 (worst case, usually far better) of selecting a
* number inside the range we need, so it should rarely need
* to re-roll.
*/
for (;;) {
r = arc4random();
if (r >= min)
break;
}
return r % upper_bound;
}
It is worth noting the latest commit comment on this code for those who need to implement similar things:
Change arc4random_uniform() to calculate 2**32 % upper_bound as
-upper_bound % upper_bound. Simplifies the code and makes it the
same on both ILP32 and LP64 architectures, and also slightly faster on
LP64 architectures by using a 32-bit remainder instead of a 64-bit
remainder.
Pointed out by Jorden Verwer on tech#
ok deraadt; no objections from djm or otto
The Java implementation is also easily findable (see previous link):
public int nextInt(int n) {
if (n <= 0)
throw new IllegalArgumentException("n must be positive");
if ((n & -n) == n) // i.e., n is a power of 2
return (int)((n * (long)next(31)) >> 31);
int bits, val;
do {
bits = next(31);
val = bits % n;
} while (bits - val + (n-1) < 0);
return val;
}

Definition
Modulo Bias is the inherent bias in using modulo arithmetic to reduce an output set to a subset of the input set. In general, a bias exists whenever the mapping between the input and output set is not equally distributed, as in the case of using modulo arithmetic when the size of the output set is not a divisor of the size of the input set.
This bias is particularly hard to avoid in computing, where numbers are represented as strings of bits: 0s and 1s. Finding truly random sources of randomness is also extremely difficult, but is beyond the scope of this discussion. For the remainder of this answer, assume that there exists an unlimited source of truly random bits.
Problem Example
Let's consider simulating a die roll (0 to 5) using these random bits. There are 6 possibilities, so we need enough bits to represent the number 6, which is 3 bits. Unfortunately, 3 random bits yields 8 possible outcomes:
000 = 0, 001 = 1, 010 = 2, 011 = 3
100 = 4, 101 = 5, 110 = 6, 111 = 7
We can reduce the size of the outcome set to exactly 6 by taking the value modulo 6, however this presents the modulo bias problem: 110 yields a 0, and 111 yields a 1. This die is loaded.
Potential Solutions
Approach 0:
Rather than rely on random bits, in theory one could hire a small army to roll dice all day and record the results in a database, and then use each result only once. This is about as practical as it sounds, and more than likely would not yield truly random results anyway (pun intended).
Approach 1:
Instead of using the modulus, a naive but mathematically correct solution is to discard results that yield 110 and 111 and simply try again with 3 new bits. Unfortunately, this means that there is a 25% chance on each roll that a re-roll will be required, including each of the re-rolls themselves. This is clearly impractical for all but the most trivial of uses.
Approach 2:
Use more bits: instead of 3 bits, use 4. This yield 16 possible outcomes. Of course, re-rolling anytime the result is greater than 5 makes things worse (10/16 = 62.5%) so that alone won't help.
Notice that 2 * 6 = 12 < 16, so we can safely take any outcome less than 12 and reduce that modulo 6 to evenly distribute the outcomes. The other 4 outcomes must be discarded, and then re-rolled as in the previous approach.
Sounds good at first, but let's check the math:
4 discarded results / 16 possibilities = 25%
In this case, 1 extra bit didn't help at all!
That result is unfortunate, but let's try again with 5 bits:
32 % 6 = 2 discarded results; and
2 discarded results / 32 possibilities = 6.25%
A definite improvement, but not good enough in many practical cases. The good news is, adding more bits will never increase the chances of needing to discard and re-roll. This holds not just for dice, but in all cases.
As demonstrated however, adding an 1 extra bit may not change anything. In fact if we increase our roll to 6 bits, the probability remains 6.25%.
This begs 2 additional questions:
If we add enough bits, is there a guarantee that the probability of a discard will diminish?
How many bits are enough in the general case?
General Solution
Thankfully the answer to the first question is yes. The problem with 6 is that 2^x mod 6 flips between 2 and 4 which coincidentally are a multiple of 2 from each other, so that for an even x > 1,
[2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)
Thus 6 is an exception rather than the rule. It is possible to find larger moduli that yield consecutive powers of 2 in the same way, but eventually this must wrap around, and the probability of a discard will be reduced.
Without offering further proof, in general using double the number
of bits required will provide a smaller, usually insignificant,
chance of a discard.
Proof of Concept
Here is an example program that uses OpenSSL's libcrypo to supply random bytes. When compiling, be sure to link to the library with -lcrypto which most everyone should have available.
#include <iostream>
#include <assert.h>
#include <limits>
#include <openssl/rand.h>
volatile uint32_t dummy;
uint64_t discardCount;
uint32_t uniformRandomUint32(uint32_t upperBound)
{
assert(RAND_status() == 1);
uint64_t discard = (std::numeric_limits<uint64_t>::max() - upperBound) % upperBound;
RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
while(randomPool > (std::numeric_limits<uint64_t>::max() - discard)) {
RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
++discardCount;
}
return randomPool % upperBound;
}
int main() {
discardCount = 0;
const uint32_t MODULUS = (1ul << 31)-1;
const uint32_t ROLLS = 10000000;
for(uint32_t i = 0; i < ROLLS; ++i) {
dummy = uniformRandomUint32(MODULUS);
}
std::cout << "Discard count = " << discardCount << std::endl;
}
I encourage playing with the MODULUS and ROLLS values to see how many re-rolls actually happen under most conditions. A sceptical person may also wish to save the computed values to file and verify the distribution appears normal.

Mark's Solution (The accepted solution) is Nearly Perfect.
int x;
do {
x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));
x %= n;
edited Mar 25 '16 at 23:16
Mark Amery 39k21170211
However, it has a caveat which discards 1 valid set of outcomes in any scenario where RAND_MAX (RM) is 1 less than a multiple of N (Where N = the Number of possible valid outcomes).
ie, When the 'count of values discarded' (D) is equal to N, then they are actually a valid set (V), not an invalid set (I).
What causes this is at some point Mark loses sight of the difference between N and Rand_Max.
N is a set who's valid members are comprised only of Positive Integers, as it contains a count of responses that would be valid. (eg: Set N = {1, 2, 3, ... n } )
Rand_max However is a set which ( as defined for our purposes ) includes any number of non-negative integers.
In it's most generic form, what is defined here as Rand Max is the Set of all valid outcomes, which could theoretically include negative numbers or non-numeric values.
Therefore Rand_Max is better defined as the set of "Possible Responses".
However N operates against the count of the values within the set of valid responses, so even as defined in our specific case, Rand_Max will be a value one less than the total number it contains.
Using Mark's Solution, Values are Discarded when: X => RM - RM % N
EG:
Ran Max Value (RM) = 255
Valid Outcome (N) = 4
When X => 252, Discarded values for X are: 252, 253, 254, 255
So, if Random Value Selected (X) = {252, 253, 254, 255}
Number of discarded Values (I) = RM % N + 1 == N
IE:
I = RM % N + 1
I = 255 % 4 + 1
I = 3 + 1
I = 4
X => ( RM - RM % N )
255 => (255 - 255 % 4)
255 => (255 - 3)
255 => (252)
Discard Returns $True
As you can see in the example above, when the value of X (the random number we get from the initial function) is 252, 253, 254, or 255 we would discard it even though these four values comprise a valid set of returned values.
IE: When the count of the values Discarded (I) = N (The number of valid outcomes) then a Valid set of return values will be discarded by the original function.
If we describe the difference between the values N and RM as D, ie:
D = (RM - N)
Then as the value of D becomes smaller, the Percentage of unneeded re-rolls due to this method increases at each natural multiplicative. (When RAND_MAX is NOT equal to a Prime Number this is of valid concern)
EG:
RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%
RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%
Since the percentage of Rerolls needed increases the closer N comes to RM, this can be of valid concern at many different values depending on the constraints of the system running he code and the values being looked for.
To negate this we can make a simple amendment As shown here:
int x;
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
x %= n;
This provides a more general version of the formula which accounts for the additional peculiarities of using modulus to define your max values.
Examples of using a small value for RAND_MAX which is a multiplicative of N.
Mark'original Version:
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.
Generalized Version 1:
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.
Additionally, in the case where N should be the number of values in RAND_MAX; in this case, you could set N = RAND_MAX +1, unless RAND_MAX = INT_MAX.
Loop-wise you could just use N = 1, and any value of X will be accepted, however, and put an IF statement in for your final multiplier. But perhaps you have code that may have a valid reason to return a 1 when the function is called with n = 1...
So it may be better to use 0, which would normally provide a Div 0 Error, when you wish to have n = RAND_MAX+1
Generalized Version 2:
int x;
if n != 0 {
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
x %= n;
} else {
x = rand();
}
Both of these solutions resolve the issue with needlessly discarded valid results which will occur when RM+1 is a product of n.
The second version also covers the edge case scenario when you need n to equal the total possible set of values contained in RAND_MAX.
The modified approach in both is the same and allows for a more general solution to the need of providing valid random numbers and minimizing discarded values.
To reiterate:
The Basic General Solution which extends mark's example:
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x;
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
x %= n;
The Extended General Solution which Allows one additional scenario of RAND_MAX+1 = n:
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x;
if n != 0 {
do {
x = rand();
} while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
x %= n;
} else {
x = rand();
}
In some languages ( particularly interpreted languages ) doing the calculations of the compare-operation outside of the while condition may lead to faster results as this is a one-time calculation no matter how many re-tries are required. YMMV!
// Assumes:
// RAND_MAX is a globally defined constant, returned from the environment.
// int n; // User input, or externally defined, number of valid choices.
int x; // Resulting random number
int y; // One-time calculation of the compare value for x
y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n)
if n != 0 {
do {
x = rand();
} while (x > y);
x %= n;
} else {
x = rand();
}

There are two usual complaints with the use of modulo.
one is valid for all generators. It is easier to see in a limit case. If your generator has a RAND_MAX which is 2 (that isn't compliant with the C standard) and you want only 0 or 1 as value, using modulo will generate 0 twice as often (when the generator generates 0 and 2) as it will generate 1 (when the generator generates 1). Note that this is true as soon as you don't drop values, whatever the mapping you are using from the generator values to the wanted one, one will occurs twice as often as the other.
some kind of generator have their less significant bits less random than the other, at least for some of their parameters, but sadly those parameter have other interesting characteristic (such has being able to have RAND_MAX one less than a power of 2). The problem is well known and for a long time library implementation probably avoid the problem (for instance the sample rand() implementation in the C standard use this kind of generator, but drop the 16 less significant bits), but some like to complain about that and you may have bad luck
Using something like
int alea(int n){
assert (0 < n && n <= RAND_MAX);
int partSize =
n == RAND_MAX ? 1 : 1 + (RAND_MAX-n)/(n+1);
int maxUsefull = partSize * n + (partSize-1);
int draw;
do {
draw = rand();
} while (draw > maxUsefull);
return draw/partSize;
}
to generate a random number between 0 and n will avoid both problems (and it avoids overflow with RAND_MAX == INT_MAX)
BTW, C++11 introduced standard ways to the the reduction and other generator than rand().

With a RAND_MAX value of 3 (in reality it should be much higher than that but the bias would still exist) it makes sense from these calculations that there is a bias:
1 % 2 = 1
2 % 2 = 0
3 % 2 = 1
random_between(1, 3) % 2 = more likely a 1
In this case, the % 2 is what you shouldn't do when you want a random number between 0 and 1. You could get a random number between 0 and 2 by doing % 3 though, because in this case: RAND_MAX is a multiple of 3.
Another method
There is much simpler but to add to other answers, here is my solution to get a random number between 0 and n - 1, so n different possibilities, without bias.
the number of bits (not bytes) needed to encode the number of possibilities is the number of bits of random data you'll need
encode the number from random bits
if this number is >= n, restart (no modulo).
Really random data is not easy to obtain, so why use more bits than needed.
Below is an example in Smalltalk, using a cache of bits from a pseudo-random number generator. I'm no security expert so use at your own risk.
next: n
| bitSize r from to |
n < 0 ifTrue: [^0 - (self next: 0 - n)].
n = 0 ifTrue: [^nil].
n = 1 ifTrue: [^0].
cache isNil ifTrue: [cache := OrderedCollection new].
cache size < (self randmax highBit) ifTrue: [
Security.DSSRandom default next asByteArray do: [ :byte |
(1 to: 8) do: [ :i | cache add: (byte bitAt: i)]
]
].
r := 0.
bitSize := n highBit.
to := cache size.
from := to - bitSize + 1.
(from to: to) do: [ :i |
r := r bitAt: i - from + 1 put: (cache at: i)
].
cache removeFrom: from to: to.
r >= n ifTrue: [^self next: n].
^r

Modulo reduction is a commonly seen way to make a random integer generator avoid the worst case of running forever.
When the range of possible integers is unknown, however, there is no way in general to "fix" this worst case of running forever without introducing bias. It's not just modulo reduction (rand() % n, discussed in the accepted answer) that will introduce bias this way, but also the "multiply-and-shift" reduction of Daniel Lemire, or if you stop rejecting an outcome after a set number of iterations. (To be clear, this doesn't mean there is no way to fix the bias issues present in pseudorandom generators. For example, even though modulo and other reductions are biased in general, they will have no issues with bias if the range of possible integers is a power of 2 and if the random generator produces unbiased random bits or blocks of them.)
The following answer of mine discusses the relationship between running time and bias in random generators, assuming we have a "true" random generator that can produce unbiased and independent random bits. The answer doesn't even involve the rand() function in C because it has many issues. Perhaps the most serious here is the fact that the C standard does not explicitly specify a particular distribution for the numbers returned by rand(), not even a uniform distribution.
How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?

As the accepted answer indicates, "modulo bias" has its roots in the low value of RAND_MAX. He uses an extremely small value of RAND_MAX (10) to show that if RAND_MAX were 10, then you tried to generate a number between 0 and 2 using %, the following outcomes would result:
rand() % 3 // if RAND_MAX were only 10, gives
output of rand() | rand()%3
0 | 0
1 | 1
2 | 2
3 | 0
4 | 1
5 | 2
6 | 0
7 | 1
8 | 2
9 | 0
So there are 4 outputs of 0's (4/10 chance) and only 3 outputs of 1 and 2 (3/10 chances each).
So it's biased. The lower numbers have a better chance of coming out.
But that only shows up so obviously when RAND_MAX is small. Or more specifically, when the number your are modding by is large compared to RAND_MAX.
A much better solution than looping (which is insanely inefficient and shouldn't even be suggested) is to use a PRNG with a much larger output range. The Mersenne Twister algorithm has a maximum output of 4,294,967,295. As such doing MersenneTwister::genrand_int32() % 10 for all intents and purposes, will be equally distributed and the modulo bias effect will all but disappear.

I just wrote a code for Von Neumann's Unbiased Coin Flip Method, that should theoretically eliminate any bias in the random number generation process. More info can be found at (http://en.wikipedia.org/wiki/Fair_coin)
int unbiased_random_bit() {
int x1, x2, prev;
prev = 2;
x1 = rand() % 2;
x2 = rand() % 2;
for (;; x1 = rand() % 2, x2 = rand() % 2)
{
if (x1 ^ x2) // 01 -> 1, or 10 -> 0.
{
return x2;
}
else if (x1 & x2)
{
if (!prev) // 0011
return 1;
else
prev = 1; // 1111 -> continue, bias unresolved
}
else
{
if (prev == 1)// 1100
return 0;
else // 0000 -> continue, bias unresolved
prev = 0;
}
}
}

C++ How I can get random value from 1 to 12? [duplicate]

This question already has answers here:
Generating a random integer from a range
(14 answers)
Closed 6 years ago.
How I can get in C++ random value from 1 to 12?
So I will have 3, or 6, or 11?

Use the following formula:
M + rand() / (RAND_MAX / (N - M + 1) + 1), M = 1, N = 12
and read up on this FAQ.
Edit: Most answers on this question do not take into account the fact that poor PRN generators (typically offered with the library function rand()) are not very random in the low order bits. Hence:
rand() % 12 + 1
is not good enough.

#include <iomanip>
#include <iostream>
#include <stdlib.h>
#include <time.h>
// initialize random seed
srand( time(NULL) );
// generate random number
int randomNumber = rand() % 12 + 1;
// output, as you seem to wan a '0'
cout << setfill ('0') << setw (2) << randomNumber;
to adress dirkgently's issue maybe something like that would be better?
// generate random number
int randomNumber = rand()>>4; // get rid of the first 4 bits
// get the value
randomNumer = randomNumer % 12 + 1;
edit after mre and dirkgently's comments

Is there some significance to the leading zero in this case? Do you intend for it to be octal, so the 12 is really 10 (in base 10)?
Getting a random number within a specified range is fairly straightforward:
int rand_lim(int limit) {
/* return a random number between 0 and limit inclusive.
*/
int divisor = RAND_MAX/(limit+1);
int retval;
do {
retval = rand() / divisor;
} while (retval > limit);
return retval;
}
(The while loop is to prevent skewed results -- some outputs happening more often than others). Skewed results are almost inevitable when/if you use division (or its remainder) directly.
If you want to print it out so even one-digit numbers show two digits (i.e. a leading 0), you can do something like:
std::cout << std::setw(2) << std::setprecision(2) << std::setfill('0') << number;
Edit: As to why this works, and why a while loop (or something similar) is needed, consider a really limited version of a generator that only produces numbers from, say, 0 to 9. Assume further that we want numbers in the range 0 to 2. We can basically arrange the numbers in a table:
0 1 2
3 4 5
6 7 8
9
Depending on our preference we could arrange the numbers in columns instead:
0 3 6
1 4 7
2 5 8
9
Either way, however, we end up with the one of the columns having one more number than any of the others. 10 divided by 3 will always have a remainder of 1, so no matter how we divide the numbers up, we're always going to have a remainder that makes one of the outputs more common than the others.
The basic idea of the code above is pretty simple: after getting a number and figuring where in a "table" like one above that number would land, it checks whether the number we've got is the "odd" one. If it is, another iteration of the loop is executed to obtain another number.
There are other ways this could be done. For example, you could start by computing the largest multiple of the range size that's still within the range of the random number generator, and repeatedly generate numbers until you get one smaller than that, then divide the number you receive to get it to the right range. In theory this could even be marginally more efficient (it avoids dividing the random number to get it into the right range until it gets a random number that it's going to use). In reality, the vast majority of the time, the loop will only execute one iteration anyway, so it makes very little difference what we execute inside or outside the loop.

You can do this, for example:
#include <cstdlib>
#include <cstdio>
#include <time.h>
int main(int argc, char **argv)
{
srand(time(0));
printf("Random number between 1 and 12: %d", (rand() % 12) + 1);
}
The srand function will seed the random number generator with the current time (in seconds) - that's the way it's usually done, but there are also more secure solutions.
rand() % 12 will give you a number between 0 and 11 (% is the modulus operator), so we add 1 here to get to your desired range of 1 to 12.
In order to print 01 02 03 and so on instead of 1 2 3, you can format the output:
printf("Random number between 01 and 12: %02d", (rand() % 12) + 1);

(rand() % 12 + 1).ToString("D2")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Will this give me proper random numbers based on these probabilities? C++ - c++

Just another way: float probs[5] = {1/7.0f, 1/7.0f, 2/7.0f, 2/7.0f, 1/7.0f}; float sum = 0; for (int i = 0; i < 5; i++) sum += probs[i]; /* edit / int rand_M() { float f = (rand()sum)/RAND_MAX; /* edit */ for (int i = 0; i < 5; i++) { if (f <= probs[i]) return i; f -= probs[i]; } return 4; }

Related

std::mersenne_twister_engine and random number generation

Probability and random numbers

C++: What are some general ways to make code more efficient for use with large numbers?

Why do people say there is modulo bias when using a random number generator?

C++ How I can get random value from 1 to 12? [duplicate]

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Will this give me proper random numbers based on these probabilities? C++ - c++

Just another way: float probs[5] = {1/7.0f, 1/7.0f, 2/7.0f, 2/7.0f, 1/7.0f}; float sum = 0; for (int i = 0; i < 5; i++) sum += probs[i]; /* edit */ int rand_M() { float f = (rand()*sum)/RAND_MAX; /* edit */ for (int i = 0; i < 5; i++) { if (f <= probs[i]) return i; f -= probs[i]; } return 4; }

Related

std::mersenne_twister_engine and random number generation

Probability and random numbers

C++: What are some general ways to make code more efficient for use with large numbers?

Why do people say there is modulo bias when using a random number generator?

C++ How I can get random value from 1 to 12? [duplicate]

Categories

Resources

Just another way: float probs[5] = {1/7.0f, 1/7.0f, 2/7.0f, 2/7.0f, 1/7.0f}; float sum = 0; for (int i = 0; i < 5; i++) sum += probs[i]; /* edit / int rand_M() { float f = (rand()sum)/RAND_MAX; /* edit */ for (int i = 0; i < 5; i++) { if (f <= probs[i]) return i; f -= probs[i]; } return 4; }