Pick a matrix cell according to its probability - c++

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.

I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

Related

efficiently mask-out exactly 30% of array with 1M entries

My question's header is similar to this link, however that one wasn't answered to my expectations.
I have an array of integers (1 000 000 entries), and need to mask exactly 30% of elements.
My approach is to loop over elements and roll a dice for each one. Doing it in a non-interrupted manner is good for cache coherency.
As soon as I notice that exactly 300 000 of elements were indeed masked, I need to stop. However, I might reach the end of an array and have only 200 000 elements masked, forcing me to loop a second time, maybe even a third, etc.
What's the most efficient way to ensure I won't have to loop a second time, and not being biased towards picking some elements?
Edit:
//I need to preserve the order of elements.
//For instance, I might have:
[12, 14, 1, 24, 5, 8]
//Masking away 30% might give me:
[0, 14, 1, 24, 0, 8]
The result of masking must be the original array, with some elements set to zero
Just do a fisher-yates shuffle but stop at only 300000 iterations. The last 300000 elements will be the randomly chosen ones.
std::size_t size = 1000000;
for(std::size_t i = 0; i < 300000; ++i)
{
std::size_t r = std::rand() % size;
std::swap(array[r], array[size-1]);
--size;
}
I'm using std::rand for brevity. Obviously you want to use something better.
The other way is this:
for(std::size_t i = 0; i < 300000;)
{
std::size_t r = rand() % 1000000;
if(array[r] != 0)
{
array[r] = 0;
++i;
}
}
Which has no bias and does not reorder elements, but is inferior to fisher yates, especially for high percentages.
When I see a massive list, my mind always goes first to divide-and-conquer.
I won't be writing out a fully-fleshed algorithm here, just a skeleton. You seem like you have enough of a clue to take decent idea and run with it. I think I only need to point you in the right direction. With that said...
We'd need an RNG that can return a suitably-distributed value for how many masked values could potentially be below a given cut point in the list. I'll use the halfway point of the list for said cut. Some statistician can probably set you up with the right RNG function. (Anyone?) I don't want to assume it's just uniformly random [0..mask_count), but it might be.
Given that, you might do something like this:
// the magic RNG your stats homework will provide
int random_split_sub_count_lo( int count, int sub_count, int split_point );
void mask_random_sublist( int *list, int list_count, int sub_count )
{
if (list_count > SOME_SMALL_THRESHOLD)
{
int list_count_lo = list_count / 2; // arbitrary
int list_count_hi = list_count - list_count_lo;
int sub_count_lo = random_split_sub_count_lo( list_count, mask_count, list_count_lo );
int sub_count_hi = list_count - sub_count_lo;
mask( list, list_count_lo, sub_count_lo );
mask( list + sub_count_lo, list_count_hi, sub_count_hi );
}
else
{
// insert here some simple/obvious/naive implementation that
// would be ludicrous to use on a massive list due to complexity,
// but which works great on very small lists. I'm assuming you
// can do this part yourself.
}
}
Assuming you can find someone more informed on statistical distributions than I to provide you with a lead on the randomizer you need to split the sublist count, this should give you O(n) performance, with 'n' being the number of masked entries. Also, since the recursion is set up to traverse the actual physical array in constantly-ascending-index order, cache usage should be as optimal as it's gonna get.
Caveat: There may be minor distribution issues due to the discrete nature of the list versus the 30% fraction as you recurse down and down to smaller list sizes. In practice, I suspect this may not matter much, but whatever person this solution is meant for may not be satisfied that the random distribution is truly uniform when viewed under the microscope. YMMV, I guess.
Here's one suggestion. One million bits is only 128K which is not an onerous amount.
So create a bit array with all items initialised to zero. Then randomly select 300,000 of them (accounting for duplicates, of course) and mark those bits as one.
Then you can run through the bit array and, any that are set to one (or zero, if your idea of masking means you want to process the other 700,000), do whatever action you wish to the corresponding entry in the original array.
If you want to ensure there's no possibility of duplicates when randomly selecting them, just trade off space for time by using a Fisher-Yates shuffle.
Construct an collection of all the indices and, for each of the 700,000 you want removed (or 300,000 if, as mentioned, masking means you want to process the other ones) you want selected:
pick one at random from the remaining set.
copy the final element over the one selected.
reduce the set size.
This will leave you with a random subset of indices that you can use to process the integers in the main array.
You want reservoir sampling. Sample code courtesy of Wikipedia:
(*
S has items to sample, R will contain the result
*)
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]

What is the Big-O of code that uses random number generators?

I want to fill the array 'a' with random values from 1 to N (no repeated values). Lets suppose Big-O of randInt(i, j) is O(1) and this function generates random values from i to j.
Examples of the output are:
{1,2,3,4,5} or {2,3,1,4,5} or {5,4,2,1,3} but not {1,2,1,3,4}
#include<set>
using std::set;
set<int> S;// space O(N) ?
int a[N]; // space O(N)
int i = 0; // space O(1)
do {
int val = randInt(1,N); //space O(1), time O(1) variable val is created many times ?
if (S.find(val) != S.end()) { //time O(log N)?
a[i] = val; // time O(1)
i++; // time O(1)
S.insert(val); // time O(log N) <-- we execute N times O(N log N)
}
} while(S.size() < N); // time O(1)
The While Loop will continue until we generate all the values from 1 to N.
My understanding is that Set sorts the values in logarithmic time log(N), and inserts in log(N).
Big-O = O(1) + O(X*log N) + O(N*log N) = O(X*log N)
Where X the more, the high probability to generate a number that is not in the Set.
time O(X log N)
space O(2N+1) => O(N), we reuse the space of val
Where ?? it is very hard to generate all different numbers each time randInt is executed, so at least I expect to execute N times.
Is the variable X created many times ?
What would be the a good value for X?
Suppose that the RNG is ideal. That is, repeated calls to randInt(1,N) generate an i.i.d. (independent and identically distributed) sequence of values uniformly distributed on {1,...,N}.
(Of course, in reality the RNG won't be ideal. But let's go with it since it makes the math easier.)
Average case
In the first iteration, a random value val1 is chosen which of course is not in the set S yet.
In the next iteration, another random value is chosen.
With probability (N-1)/N, it will be distinct from val1 and the inner conditional will be executed. In this case, call the chosen value val2.
Otherwise (with probability 1/N), the chosen value will be equal to val1. Retry.
How many iterations does it take on average until a valid (distinct from val1) val2 is chosen? Well, we have an independent sequence of attempts, each of which succeeds with probability (N-1)/N, and we want to know how many attempts it takes on average until the first success. This is a geometric distribution, and in general a geometric distribution with success probability p has mean 1/p. Thus, it takes N/(N-1) attempts on average to choose val2.
Similarly, it takes N/(N-2) attempts on average to choose val3 distinct from val1 and val2, and so on. Finally, the N-th value takes N/1 = N attempts on average.
In total the do loop will be executed
times on average. The sum is the N-th harmonic number which can be roughly approximated by ln(N). (There's a well-known better approximation which is a bit more complicated and involves the Euler-Mascheroni constant, but ln(N) is good enough for finding asymptotic complexity.)
So to an approximation, the average number of iterations will be N ln N.
What about the rest of the algorithm? Things like inserting N things into a set also take at most O(N log N) time, so can be disregarded. The big remaining thing is that each iteration you have to check if the chosen random value lies in S, which takes logarithmic time in the current size of S. So we have to compute
which, from numerical experiments, appears to be approximately equal to N/2 * (ln N)^2 for large N. (Consider asking for a proof of this on math.SE, perhaps.) EDIT: See this math.SE answer for a short informal proof, and the other answer to that question for a more formal proof.
So in conclusion, the total average complexity is Θ(N (ln N)^2).
Again, this is assuming that the RNG is ideal.
Worst case
Like xaxxon mentioned, it is in principle possible (though unlikely) that the algorithm will not terminate at all. Thus, the worst case complexity would be O(∞).
That's a very bad algorithm for achieving your goal.
Simply fill the array with the numbers 1 through N and then shuffle.
That's O(N)
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
To shuffle, pick an index between 0 and N-1 and swap it with index 0. Then pick an index between 1 and N-1 and swap it with index 1. All the way until the end of the list.
In terms of your specific question, it depends on the behavior of your random number generator. If it's truly random, it may never complete. If it's pseudorandom, it depends on the period of the generator. If it has a period of 5, then you'll never have any dupes.
It's catastrophically bad code with complex behaviour. Generating the first number is O(1), Then the second involves a binary search, so a log N, plus a rerun of the generator should the number be found. The chance of getting an new number is p = 1- i/N. So the average number of re-runs is the reciprocal, and gives you another factor of N. So O(N^2 log N).
The way to do it is to generate the numbers, then shuffle them. That's O(N).

Random pairs of different bits

I have the following problem. I have a number represented in binary representation. I need a way to randomly select two bits of them that are different (i.e. find a 1 and a 0). Besides this I run other operations on that number (reversing sequences, permute sequences,...) These are the approaches I already used:
Keep track of all the ones and the zeros. When I create the binary representation of the binary number I store the places of the 0's and 1's. So that I can choose an index for one list and one index from the other one. I then have two different bits. To run my other operations I created those from an elementary swap operations which updates the indices of the 1's and 0's when manipulating. Therefore I have a third list that stores the list index for each bit. If a bit is 1 I know where to find in the list with all the indices of the ones (same goes for zeros).
The method above yields some overhead when operations are done that do not require the bits to be different. So another way would be to create the lists whenever different bits are needed.
Does anyone have a better idea to do this? I need these operations to be really fast (I am working with popcount, clz, and other binary operations)
I don't feel as though I have enough information to assess the tradeoffs properly, but perhaps you'll find this idea useful. To find a random 1 in a word (find a 1 over multiple words by popcount and reservoir sampling; find a 0 by complementing), first test the popcount. If the popcount is high, then generate indexes uniformly at random and test them until a one is found. If the popcount is medium, then take bitwise ANDs with uniform random masks (but keep the original if the AND is zero) to reduce the popcount. When the popcount is low, use clz to compile the (small) list of candidates efficiently and then sample uniformly at random.
I think the following might be a rather efficient algorithm to do what you are asking. You only iterate over each bit in the number once, and for each element, you have to generate a random number (not exactly sure how costly that is, but I believe there are some optimized CPU instructions for getting random numbers).
Idea is to iterate over all the bits, and with the right probability, update the index to the current index you are visiting.
Generic pseudocode for getting an element from a stream/array:
p = 1
e = null
for s in stream:
with probability 1/p:
replace e with s
p++
return e
Java version:
int[] getIdx(int n){
int oneIdx = 0;
int zeroIdx = 0;
int ones = 1;
int zeros = 1;
// this loop depends on whether you want to select all the prepended zeros
// in a 32/64 bit representation. Alter to your liking...
for(int i = n, j = 0; i > 0; i = i >>> 1, j++){
if((i & 1) == 1){ // current element is 1
if(Math.random() < 1/(float)ones){
oneIdx = j;
}
ones++;
} else{ // element is 0
if(Math.random() < 1/(float)zeros){
zeroIdx = j;
}
zeros++;
}
}
return new int[]{zeroIdx,oneIdx};
}
An optimization you might look into is to do the probability selection using ints instead of floats, might be slightly faster. Here is a short proof I did some time ago regarding that this works: here . I believe the algorithm is attributed to Knuth but can't remember exactly.

Generating random integers with a difference constraint

I have the following problem:
Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.
At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).
Would there happen to be a more efficient algorithm?
An example of the problem:
Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000
A comprehensive way to solve this would be to:
Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
Select a uniformly random integer i in the range [0,|X|).
Select the i'th combination from X as the result.
This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.
Note: The following is a C++ implementation of the solution provided by pentadecagon
std::vector<int> generate_random(const int n, const int m, const int k)
{
if ((n < m) || (m < k))
return std::vector<int>();
std::random_device source;
std::mt19937 generator(source());
std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);
std::vector<int> result_list;
result_list.reserve(m);
for (int i = 0; i < m; ++i)
{
result_list.push_back(distribution(generator));
}
std::sort(std::begin(result_list),std::end(result_list));
for (int i = 0; i < m; ++i)
{
result_list[i] += (i * k);
}
return result_list;
}
http://ideone.com/KOeR4R
.
EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.
Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers
b_i=a_i + i*(K-1)
Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:
import random
def random_list( N, M, K ):
s = set()
while len(s) < M:
s.add( random.randint( 1, N-(M-1)*(K-1) ) )
res = sorted( s )
for i in range(M):
res[i] += i * (K-1)
return res
First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.
Bijection:
Let
Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).
From this we get a valid solution
by setting the values mi as follows:
We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)
To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let
be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:
Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:
The third equality follows since we have this telescoping sum that cancels out almost all mi.
So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.
Picking a composition uniformly at random
Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.
Note: #DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.
You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).
Why not do this:
for (int i = 0; i < M; ++i) {
pick a random number between K and N/M
add this number to (N/M)* i;
Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)
EDIT:
Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.
Second EDIT:
Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.
Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.
Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.
There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.
Third and Final EDIT:
For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:
int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
maxRange = N - (((M - 1) - i) * K) - prevValue;
int nextValue = random(0, maxRange);
prevValue += nextValue;
store previous value;
prevValue += K;
}
This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.
Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:
What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.
Rather than pseudocode, let me work an example using OP's original example:
0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))
In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.
By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.
Hope one of these ideas works for you!

How to get 2 random (different) elements from a c++ vector

I would like to get 2 random different elements from an std::vector. How can I do this so that:
It is fast (it is done thousands of times in my algorithm)
It is elegant
The elements selection is really uniformly distributed
For elegance and simplicty:
void Choose (const int size, int &first, int &second)
{
// pick a random element
first = rand () * size / MAX_RAND;
// pick a random element from what's left (there is one fewer to choose from)...
second = rand () * (size - 1) / MAX_RAND;
// ...and adjust second choice to take into account the first choice
if (second >= first)
{
++second;
}
}
using first and second to index the vector.
For uniformness, this is very tricky since as size approaches RAND_MAX there will be a bias towards the lower values and if size exceeds RAND_MAX then there will be elements that are never chosen. One solution to overcome this is to use a binary search:
int GetRand (int size)
{
int lower = 0, upper = size;
do
{
int mid = (lower + upper) / 2;
if (rand () > RAND_MAX / 2) // not a great test, perhaps use parity of rand ()?
{
lower = mid;
}
else
{
upper = mid;
}
} while (upper != lower); // this is just to show the idea,
// need to cope with lower == mid and lower != upper
// and all the other edge conditions
return lower;
}
What you need is to generate M uniformly distributed random numbers from [0, N) range, but there is one caveat here.
One needs to note that your statement of the problem is ambiguous. What is meant by the uniformly distributed selection? One thing is to say that each index has to be selected with equal probability (of M/N, of course). Another thing is to say that each two-index combination has to be selected with equal probability. These two are not the same. Which one did you have in mind?
If M is considerably smaller than N, the classic algorithm for selecting M numbers out of [0, N) range is Bob Floyd algorithm that can be found in Bentley's "Programming Peals" book. It looks as follows (a sketch)
for (int j = N - M; i < N; ++j) {
int rand = random(0, j); // generate a random integer in range [0, j]
if (`rand` has not been generated before)
output rand;
else
output j;
}
In order to implement the check of whether rand has already been generated or not for relatively high M some kind of implementation of a set is necessary, but in your case M=2 it is straightforward and easy.
Note that this algorithm distributes the sets of M numbers uniformly. Also, this algorithm requires exactly M iterations (attempts) to generate M random numbers, i.e. it doesn't follow that flawed "trial-and-error" approach often used in various ad-hoc algorithms intended to solve the same problem.
Adapting the above to your specific situation, the correct algorithm will look as follows
first = random(0, N - 2);
second = random(0, N - 1);
if (second == first)
second = N - 1;
(I leave out the internal details of random(a, b) as an implementation detail).
It might not be immediately obvious why the above works correctly and produces a truly uniform distribution, but it really does :)
How about using a std::queue and doing std::random_shuffle on them. Then just pop til your hearts content?
Not elegant, but extreamly simple: just draw a random number in [0, vector.size()[ and check it's not twice the same.
Simplicity is also in some way elegance ;)
What do you call fast ? I guess this can be done thousands of times within a millisecond.
Whenever need something random, you are going to have various questions about the random number properties regarding uniformity, distribution and so on.
Assuming you've found a suitable source of randomness for your application, then the simplest way to generate pairs of uncorrelated entries is just to pick two random indexes and test them to ensure they aren't equal.
Given a vector of N+1 entries, another option is to generate an index i in the range 0..N. element[i] is choice one. Swap elements i and N. Generate an index j in the range 0..(N-1). element[j] is your second choice. This slowly shuffles your vector which may be problematical, but it can be avoided by using a second vector which holds indexes into the first, and shuffling that. This method trades a swap for the index comparison and tends to be more efficient for small vectors (a dozen or fewer elements, typically) as it avoids having to do multiple comparisons as the number of collisions increase.
You might wanna look into the gnu scientific library. There are some pretty nice random number generators in there that are guaranteed to be random down to the bit level.