I need to generate partly random sequence of numbers such that the sequence overall has certain entropy level.
E.g. if I would feed the generated data into gzip it would be able compress it. And in fact, this would be the exact application for the code, testing data compressors.
I'm programming this in C++ and first idea that came into my mind would be to initialize bunch of std::mt19937 PRNGs with random seed and choose one randomly and make random lenght pattern with it.
The std::mt19937 is reset each time with same seed so that it always generates same pattern:
#include <iostream>
#include <random>
#include <vector>
int main() {
std::random_device rd;
std::vector<std::mt19937> rngs;
std::vector<int> seeds;
std::uniform_int_distribution<int> patternrg(0,31);
std::uniform_int_distribution<int> lenghtrg(1,64);
std::uniform_int_distribution<int> valuerg(0,255);
for(int i = 0; i < 32; ++i) {
seeds.push_back(rd());
rngs.emplace_back(seeds.back());
}
for(;;) {
// Choose generator and pattern lenght randomly.
auto gen = patternrg(rd);
auto len = lenghtrg(rd);
rngs[gen].seed(seeds[gen]);
for(int i = 0; i < len; ++i) {
std::cout << valuerg( rngs[gen] )<<"\n";
}
}
}
Above code mets the first requirement of generating compressable randomness but the second is harder: how to control the level entropy/randomness?
Let me write few sentences which you could find useful. Suppose we want to sample one bit with given entropy. So it is either 0 or 1, and entropy you want is equal to e.
H(10|p) = -p log2(p) - (1 - p) log2(1 - p), where p is probability to get 1. Simple test - in case of p=1/2 one would get entropy of 1 - maximum entropy. So you
pick e equal to some value below 1, solve equation
-p log2(p) - (1 - p) log2(1 - p) = e
and get back p, and then you could start sampling using Bernoulli distribution. Simple demo is here. And in C++ one could use standard library routine.
Ok, suppose you want to sample one byte with given entropy. It has 256 values, and entropy
H(byte|\vec{p}) = -Sum(1...256) pi log2(pi).
Again, if all combination are equiprobable (pi=1/256), you'll get -256/256 log2(1/256) = 8, which is maximum entropy. If you now fix your entropy (say, I want it to be 7), then there would be infinite number of solutions for pi, there is no single unique realization of given entropy.
You could simplify problem a bit - lets consider again one parameter case, where probability to find 1 is p and probability to find 0 is (1-p). Thus, from 256 outcomes we now got 9 of them - 00000000, 00000001, 00000011, 00000111, 00001111, 00011111, 00111111, 01111111, 11111111.
For each of those cases we could write probability, compute entropy, assign it to whatever you want and solve back to find p.
Sampling would be relatively easy - first step would be sample on of the 9 combinations via discrete distribution, and second step would be shuffle bits inside the byte using Fisher-Yates shuffle.
Same approach might be used for, say, 32bits or 64bits - you have 33 or 65 cases, construct entropy, assign to whatever you want, find p, sample one of them and then
shuffle bits inside sampled value.
No code right now, but I could probably write some code later if there is an interest...
UPDATE
Keep in mind another peculiar property of the fixing entropy. Even for a simple case of single bit, if you try to solve
-p log2(p) - (1 - p) log2(1 - p) = e
for a given e, you'll get two answers, and it is easy to understand why - equation is symmetric wrt p and 1-p (or replacing 0s with 1s and 1s with 0s). In other words, for entropy it is irrelevant if you transfer information using mostly zeros, or mostly ones. It is not true for things like natural text.
The entropy rate (in terms of the output byte values, not the human-readable characters) of your construction has several complications, but (for a number of generators much smaller than 256) it’s a good approximation to say that it’s the entropy of each choice (5 bits to pick the sequence plus 6 for its length) divided by the average length of the subsequences (65/2), or 0.338 bits out of a possible 8 per byte. (This is significantly lower than normal English text.) You can increase the entropy rate by defining more sequences or reducing the typical length of the subsequence drawn from each. (If the subsequence is often just one character, or the sequences number in the hundreds, collisions will necessarily reduce the entropy rate below this estimate and limit it to 8 bits per byte.)
Another easily tunable sequence class involves drawing single bytes from [0,n] with a probability p<1/(n+1) for 0 and the others equally likely. This gives an entropy rate H=(1-p)ln (n/(1-p))-p ln p which lies on [ln n,ln (n+1)), so any desired rate can be selected by choosing n and then p appropriately. (Remember to use lg instead of ln if you want bits of entropy.)
Related
I'm using the mt19937 generator to generate normal random numbers as shown below:
normal_distribution<double> normalDistr(0, 1);
mt19937 generator(123);
vector<double> randNums(1000000);
for (size_t i = 0; i != 1000000; ++i)
{
randNums[i] = normalDistr(generator);
}
The above code works, however since I'm generating more than 100 million normal random numbers in my code, the above is very slow.
Is there a faster way to generate normal random numbers?
The following is some background on how the code would be used:
Quality of the random numbers is not that important
Precision of the numbers is not that important, either double or float is OK
The normal distribution always has mean = 0 and sigma = 1
EDIT:
#Dúthomhas, Andrew:
After profiling the following function is taking up more than 50% of the time:
std::normal_distribution<double>::_Eval<std::mersenne_twister_engine<unsigned int,32,624,397,31,2567483615,11,4294967295,7,2636928640,15,4022730752,18,1812433253> >
Most importantly, do you really need 100,000,000 random numbers simultaneously? The writing to and subsequent reading from RAM of all these data unavoidably requires significant time. If you only need the random numbers one at a time, you should avoid that.
Assuming that you do need all of these numbers in RAM, then you should first
profile your code if you really want to know where the CPU time is spent/lost.
Second, you should avoid unnecessary re-allocation and initialisation of the data. This is most easily done by using std::vector::reserve(final_size) in conjunction with std::vector::push_back().
Third, you could use a faster RNG than std::mt19937. That RNG is recommended when the quality of the numbers is of importance. The online documentation says that the lagged Fibonacci generator (implemented in std:: subtract_with_carry_engine) is fast, but it may not have a long enough recurrence period -- you must check this. Alternatively, you may want to use std::min_stdrand (which uses the linear congruential generator)
std::vector<double> make_normal_random(std::size_t number,
std::uint_fast32_t seed)
{
std::normal_distribution<double> normalDistr(0,1);
std::min_stdrand generator(seed);
std::vector<double> randNums;
randNums.reserve(number);
while(number--)
randNums.push_back(normalDistr(generator));
return randNums;
}
You also will want to look into std::vector reserve rather than resize. It will allow you get all the memory you will need in 1 shot. I am assuming you don't need all 100 million doubles at once?
If it really is the generator that is the cause of the performance degradation then use the ordinary rand function (you need to draw numbers in pairs), transform to a float or double in 0, 1 then apply the Box Muller transformation.
That will be hard to beat in terms of time, but note that the statistical properties are no better than rand.
A numerical recipes routine gasdev does this - you should be able to download a copy.
Random question.
I am attempting to create a program which would generate a pseudo-random distribution. I am trying to find the right pseudo-random algorithm for my needs. These are my concerns:
1) I need one input to generate the same output every time it is used.
2) It needs to be random enough that a person who looks at the output from input 1 sees no connection between that and the output from input 2 (etc.), but there is no need for it to be cryptographically secure or truly random.
3)Its output should be a number between 0 and (29^3200)-1, with every possible integer in that range a possible and equally (or close to it) likely output.
4) I would like to be able to guarantee that every possible permutation of sequences of 410 outputs is also a potential output of consecutive inputs. In other words, all the possible groupings of 410 integers between 0 and (29^3200)-1 should be potential outputs of sequential inputs.
5) I would like the function to be invertible, so that I could take an integer, or a series of integers, and say which input or series of inputs would produce that result.
The method I have developed so far is to run the input through a simple halson sequence:
boost::multiprecision::mpz_int denominator = 1;
boost::multiprecision::mpz_int numerator = 0;
while (input>0) {
denominator *=3;
numerator = numerator * 3 + (input%3);
input = input/3;
}
and multiply the result by 29^3200. It meets requirements 1-3, but not 4. And it is invertible only for single integers, not for series (since not all sequences can be produced by it). I am working in C++, using boost multiprecision.
Any advice someone can give me concerning a way to generate a random distribution meeting these requirements, or just a class of algorithms worth researching towards this end, would be greatly appreciated. Thank you in advance for considering my question.
----UPDATE----
Since multiple commenters have focused on the size of the numbers in question, I just wanted to make clear that I recognize the practical problems that working with such sets poses but in asking this question I'm interested only in the theoretical or conceptual approach to the problem - for example, imagine working with a much smaller set of integers like 0 to 99, and the permutations of sets of 10 of output sequences. How would you design an algorithm to meet these five conditions - 1)input is deterministic, 2)appears random (at least to the human eye), 3)every integer in the range is a possible output, 4)not only all values, but also all permutations of value sequences are possible outputs, 5)function is invertible.
---second update---
with many thanks to #Severin Pappadeux I was able to invert an lcg. I thought I'd add a little bit about what I did to hopefully make it easier for anyone seeing this in the future. First of all, these are excellent sources on inverting modular functions:
https://www.khanacademy.org/computing/computer-science/cryptography/modarithmetic/a/modular-inverses
https://www.khanacademy.org/computer-programming/discrete-reciprocal-mod-m/6253215254052864
If you take the equation next=ax+c%m, using the following code with your values for a and m will print out the euclidean equations you need to find ainverse, as well as the value of ainverse:
int qarray[12];
qarray[0]=0;
qarray[1]=1;
int i =2;
int reset = m;
while (m % a >0) {
int remainder=m%a;
int quotient=m/a;
std::cout << m << " = " << quotient << "*" << a << " + " << remainder << "\n";
qarray[i] =qarray[i-2]-(qarray[i-1]*quotient);
m=a;
a=remainder;
i++;
}
if (qarray[i-1]<0) {qarray[i-1]+=reset;}
std::cout << qarray[i-1] << "\n";
The other thing it took me a while to figure out is that if you get a negative result, you should add m to it. You should add a similar term to your new equation:
prev = (ainverse(next-c))%m;
if (prev<0) {prev+=m;}
I hope that helps anyone who ventures down this road in the future.
Ok, I'm not sure if there is a general answer, so I would concentrate on random number generator having, say, 64bit internal state/seed, producing 64bit output and having 2^64-1 period. In particular, I would look at linear congruential generator (aka LCG) in the form of
next = (a * prev + c) mod m
where a and m are primes to each other
So:
1) Check
2) Check
3) Check (well, for 64bit space of course)
4) Check (again, except 0 I believe, but each and every permutation of 64bits is output of LCG starting with some seed)
5) Check. LCG is known to be reversible, i.e. one could get
prev = (next - c) * a_inv mod m
where a_inv could be computed from a, m using Euclid's algorithm
Well, if it looks ok to you, you could try to implement LCG in your 15546bits space
UPDATE
And quick search shows reversible LCG discussion/code here
Reversible pseudo-random sequence generator
In your update, "appears random (to the human eye)" is the phrasing you use. The definition of "appears random" is not a well agreed upon topic. There are varying degrees of tests for "randomness."
However, if you're just looking to make it appear random to the human eye, you can just use ring multiplication.
Start with the idea of generating N! values between 0 and M (N>=410, M>=29^3200)
Group this together into one big number. we're going to generate a single number ranging from 0 to *M^N!. If we can show that the pseudorandom number generator generates every value from 0 to M^N!, we guarantee your permutation rule.
Now we need to make it "appear random." To the human eye, Linear Congruent Generators are enough. Pick a LCG with a period greater than or equal to 410!*M^N satisfying the rules to ensure a complete period. Easiest way to ensure fairness is to pick a LCG in the form x' = (ax+c) mod M^N!
That'll do the trick. Now, the hard part is proving that what you did was worth your time. Consider that the period of just a 29^3200 long sequence is outside the realm of physical reality. You'll never actually use it all. Ever. Consider that a superconductor made of Josephine junctions (10^-12kg processing 10^11bits/s) weighing the mass of the entire universe 3*10^52kg) can process roughly 10^75bits/s. A number that can count to 29^3200 is roughly 15545 bits long, so that supercomputer can process roughly 6.5x10^71 numbers/s. This means it will take roughly 10^4600s to merely count that high, or somewhere around 10^4592 years. Somewhere around 10^12 years from now, the stars are expected to wink out, permanently, so it could be a while.
There are M**N sequences of N numbers between 0 and M-1.
You can imagine writing all of them one after the other in a (pseudorandom) sequence and placing your read pointer randomly in the resulting loop of N*(M**N) numbers between 0 and M-1...
def output(input):
total_length = N*(M**N)
index = input % total_length
permutation_index = shuffle(index / N, M**N)
element = input % N
return (permutation_index / (N**element)) % M
Of course for every permutation of N elements between 0 and M-1 there is a sequence of N consecutive inputs that produces it (just un-shuffle the permutation index). I'd also say (just using symmetry reasoning) that given any starting input the output of next N elements is equally probable (each number and each sequence of N numbers is equally represented in the total period).
In at least one implementation of the standard library, the first invocation of a std::uniform_int_distribution<> does not return a random value, but rather the distribution's min value. That is, given the code:
default_random_engine engine( any_seed() );
uniform_int_distribution< int > distribution( smaller, larger );
auto x = distribution( engine );
assert( x == smaller );
...x will in fact be smaller for any values of any_seed(), smaller, or larger.
To play along at home, you can try a code sample that demonstrates this problem in gcc 4.8.1.
I trust this is not correct behavior? If it is correct behavior, why would a random distribution return this clearly non-random value?
Explanation for the observed behavior
This is how uniform_int_distribution maps the random bits to numbers if the range of possible outcomes is smaller than the range of number the rng produces:
const __uctype __uerange = __urange + 1; // __urange can be zero
const __uctype __scaling = __urngrange / __uerange;
const __uctype __past = __uerange * __scaling;
do
__ret = __uctype(__urng()) - __urngmin;
while (__ret >= __past);
__ret /= __scaling;
where __urange is larger - smaller and __urngrange is the difference between the maximum and the minimum value the rng can return. (Code from bits/uniform_int_dist.h in libstdc++ 6.1)
In our case, the rng default_random_engine is a minstd_rand0, which yields __scaling == 195225785 for the range [0,10] you tested with. Thus, if rng() < 195225785, the distribution will return 0.
The first number a minstd_rand0 returns is
(16807 * seed) % 2147483647
(where seed == 0 gets adjusted to 1 btw). We can thus see that the first value produced by a minstd_rand0 seeded with a number smaller than 11615 will yield 0 with the uniform_int_distribution< int > distribution( 0, 10 ); you used. (mod off-by-one-errors on my part. ;) )
You mentioned the problem going away for bigger seeds: As soon as the seeds get big enough to actually make the mod operation do something, we cannot simply assign a whole range of values to the same output by division, so the results will look better.
Does that mean (libstdc++'s impl of) <random> is broken?
No. You introduced significant bias in what is supposed to be a random 32 bit seed by always choosing it small. That bias showing up in the results is not surprising or evil. For random seeds, even your minstd_rand0 will yield a fairly uniformly random first value. (Though the sequence of numbers after that will not be of great statistical quality.)
What can we do about this?
Case 1: You want random number of high statistical quality.
For that, you use a better rng like mt19937 and seed its entire state space. For the Mersenne Twister, that's 624 32-bit integers. (For reference, here is my attempt to do this properly with some helpful suggestions in the answer.)
Case 2: You really want to use those small seeds only.
We can still get decent results out of this. The problem is that pseudo random number generators commonly depend "somewhat continuously" on their seed. To ship around this, we discard enough numbers to let the initially similar sequences of output diverge. So if your seed must be small, you can initialize your rng like this:
std::mt19937 rng(smallSeed);
rng.discard(700000);
It is vital that you use a good rng like the Mersenne Twister for this. I do not know of any method to get even decent values out of a poorly seeded minstd_rand0, for example see this train-wreck. Even if seeded properly, the statistical properties of a mt19937 are superior by far.
Concerns about the large state space or slow generation you sometimes hear about are usually of no concern outside the embedded world. According to boost and cacert.at, the MT is even way faster than minstd_rand0.
You still need to do the discard trick though, even if your results look good to the naked eye without. It takes less than a millisecond on my system, and you don't seed rngs very often, so there is no reason not to.
Note that I am not able to give you a sharp estimate for the number of discards we need, I took that value from this answer, it links this paper for a rational. I don't have the time to work through that right now.
the problem statement is the following:
Xorq has invented an encryption algorithm which uses bitwise XOR operations extensively. This encryption algorithm uses a sequence of non-negative integers x1, x2, … xn as key. To implement this algorithm efficiently, Xorq needs to find maximum value for (a xor xj) for given integers a,p and q such that p<=j<=q. Help Xorq to implement this function.
Input
First line of input contains a single integer T (1<=T<=6). T test cases follow.
First line of each test case contains two integers N and Q separated by a single space (1<= N<=100,000; 1<=Q<= 50,000). Next line contains N integers x1, x2, … xn separated by a single space (0<=xi< 2^15). Each of next Q lines describe a query which consists of three integers ai,pi and qi (0<=ai< 2^15, 1<=pi<=qi<= N).
Output
For each query, print the maximum value for (ai xor xj) such that pi<=j<=qi in a single line.
int xArray[100000];
cin >>t;
for(int j =0;j<t;j++)
{
cin>> n >>q;
//int* xArray = (int*)malloc(n*sizeof(int));
int i,a,pi,qi;
for(i=0;i<n;i++)
{
cin>>xArray[i];
}
for(i=0;i<q;i++)
{
cin>>a>>pi>>qi;
int max =0;
for(int it=pi-1;it<qi;it++)
{
int t = xArray[it] ^ a;
if(t>max)
max =t;
}
cout<<max<<"\n" ;
}
No other assumptions may be made except for those stated in the text of the problem (numbers are not sorted).
The code is functional but not fast enough; is reading from stdin really that slow or is there anything else I'm missing?
XOR flips bits. The max result of XOR is 0b11111111.
To get the best result
if 'a' on ith place has 1 then you have to XOR it with key that has ith bit = 0
if 'a' on ith place has 0 then you have to XOR it with key that has ith bit = 1
saying simply, for bit B you need !B
Another obvious thing is that higher order bits are more important than lower order bits.
That is:
if 'a' on highest place has B and you have found a key with highest bit = !B
then ALL keys that have highest bit = !B are worse that this one
This cuts your amount of numbers by half "in average".
How about building a huge binary tree from all the keys and ordering them in the tree by their bits, from MSB to LSB. Then, cutting the A bit-by-bit from MSB to LSB would tell you which left-right branch to take next to get the best result. Of course, that ignores PI/QI limits, but surely would give you the best result since you always pick the best available bit on i-th level.
Now if you annotate the tree nodes with low/high index ranges of its subelements (performed only done once when building the tree), then later when querying against a case A-PI-QI you could use that to filter-out branches that does not fall in the index range.
The point is that if you order the tree levels like the MSB->LSB bit order, then the decision performed at the "upper nodes" could guarantee you that currently you are in the best possible branch, and it would hold even if all the subbranches were the worst:
Being at level 3, the result of
0b111?????
can be then expanded into
0b11100000
0b11100001
0b11100010
and so on, but even if the ????? are expanded poorly, the overall result is still greater than
0b11011111
which would be the best possible result if you even picked the other branch at level 3rd.
I habe absolutely no idea how long would preparing the tree cost, but querying it for an A-PI-QI that have 32 bits seems to be something like 32 times N-comparisons and jumps, certainly faster than iterating randomly 0-100000 times and xor/maxing. And since you have up to 50000 queries, then building such tree can actually be a good investment, since such tree would be build once per keyset.
Now, the best part is that you actually dont need the whole tree. You may build such from i.e. first two or four or eight bits only, and use the index ranges from the nodes to limit your xor-max loop to a smaller part. At worst, you'd end up with the same range as PiQi. At best, it'd be down to one element.
But, looking at the max N keys, I think the whole tree might actually fit in the memory pool and you may get away without any xor-maxing loop.
I've spent some time google-ing this problem and it seams that you can find it in the context of various programming competitions. While the brute force approach is intuitive it does not really solve the challenge as it is too slow.
There are a few contraints in the problem which you need to speculate in order to write a faster algorithm:
the input consists of max 100k numbers, but there are only 32768 (2^15) possible numbers
for each input array there are Q, max 50k, test cases; each test case consists of 3 values, a,pi,and qi. Since 0<=a<2^15 and there are 50k cases, there is a chance the same value will come up again.
I've found 2 ideas for solving the problem: splitting the input in sqrt(N) intervals and building a segment tree ( a nice explanation for these approaches can be found here )
The biggest problem is the fact that for each test case you can have different values for a, and that would make previous results useless, since you need to compute max(a^x[i]), for a small number of test cases. However when Q is large enough and the value a repeats, using previous results can be possible.
I will come back with the actual results once I finish implementing both methods
How can I generate a random number within range 0 to n where n can be > RAND_MAX in c,c++?
Thanks.
split the generation in two phases, then combine the resulting numbers.
Random numbers is a very specialized subject that unless you are a maths junky is very easy to get wrong. So I would advice against building a random number from multiple sources you should use a good library.
I would first look at boost::Random
If that is not suffecient try of this group sci.crypt.random-numbers
Ask the question there they should be able to help.
suppose you want to generate a 64-bit random number, you could do this:
uint64_t n = 0;
for(int i = 0; i < 8; ++i) {
uint64_t x = generate_8bit_random_num();
n = (n << (8 * i)) | x;
}
Of course you could do it 16/32 bits at a time too, but this illustrates the concept.
How you generate that 8/16/32-bit random numbers is up to you. It could be as simple as rand() & 0xff or something better depending on how much you care about the randomness.
Assuming C++, have you tried looking at a decent random number library, like Boost.Random. Otherwise you may have to combine multiple random numbers.
If you're looking for a uniform distribution (or any distribution for that manner) , you must take care that the statistical properties of the output are sufficient for your needs. If you can't use the output of a random number generator directly, you should be very careful trying to combine numbers to achieve your needs.
At a bare minimum you should make sure the distribution is appropriate. If you're looking for a uniform distribution of integers from 0 to M, and you have some uniform random number generator g() to produce outputs that are smaller than M, make sure you do not do one of the following:
add k outputs of g() together until they're large enough (the result is nonuniform)
take r = g() + (g() << 16), then compute r % M (if the range of r is not an even multiple of M, it will weight certain values in the range slightly more than others; the shift-left itself is questionable unless g() outputs a range between 0 and a power of 2 minus 1)
Beyond that, there is the potential for cross-correlation between terms of the sequence (random number generators are supposed to produce independent identically-distributed outputs).
Read The Art of Computer Programming vol. 2 (Knuth) and/or Numerical Recipes and ask questions until you feel confident.
If your implementation has an integer type large enough to hold the result you need, it's generally easier to get a decent distribution by simply using a generator that produces the required range than to try to combine outputs from the smaller generator.
Of course, in most cases, you can just download code for something like the Mersenne Twister or (if you need a cryptographic quality generator) Blum-Blum-Shub, and forget about writing your own.
Do x random numbers (from 0 to RAND_MAX) and add them together, where
x = n % RAND_MAX
Consider a random variable which can take on values {0, 1} with P(0) = P(1) = 0.5. If you want to generate random values between 0 to 2 by summing two independent draws, you will have P(0) = 0.25, P(1) = 0.5 and P(2) = 0.25.
Therefore, use an appropriate library unless you do not care at all about the PDF of the RNG.
See also Chapter 7 in Numerical Recipes. (This is a link to the older edition but that's the one I studied anyway ;-)
There are many ways to do this.
If you are OK with less granularity (higher chance of dupes), then something like (in pseudocode) rand() * n / RAND_MAX will work to spread the values across a larger range. The catch is that in your real code you'll need to avoid overflow, either via casting rand() or n to a large-enough type (e.g. 64-bit int if RAND_MAX is 0xFFFFFFFF) to hold the multiplication result without overflow, or use a multiply-then-divide API (like GNU's MulDiv64 or Win32's MulDiv) which is optimized for this scenario.
If you want granuarity down to each integer, you can call rand() multiple times and append the results. Another answer suggests calling rand() for each 8-bit/16-bit/32-bit chunk depending on size of RAND_MAX.
But, IMHO, the above ideas can rapidly get complicated, inaccurate, or both. Generating random numbers is a solved problem in other libraries, and it's probably much easier to borrow existing code (e.g. from Boost) than try to roll your own. Open source random number generation algorithm in C++? has answers with more links if you want something besides Boost.
[ EDIT: revising after having a busy day... meant to get back and clean up my quick answer this morning, but got pulled away and only getting back now. :-) ]