Implementing Reservoir Sampling using Map Reduce - mapreduce

This link "http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html" talks about how one can implement reservoir sampling using map reduce framework. I feel their solution is complicated and the following simpler approach would work.
Problem:
Given very large number of samples, generate a set of size k such that each sample has equal probability of being present in the set.
Proposed solution:
Map operation: For each input number n, output (i, n) where i is randomly chosen in range 0 to k-1.
Reduce operation: Among all numbers with same key, pick one randomly.
Claim:
Probability of any number being in set of k size is k/n (where n is total number of samples)
Proof intuition:
Since map operation randomly assigned each input sample to bucket number i (0 <= i <= k-1), size of each bucket would be n/k.
Now each number is present only in one bucket, suppose bucket i. The probability that it gets picked in reduce operation for bucket i is 1/(n/k) = k/n
I would appreciate any second thoughts on my solution, whether it seems correct or not.

There's a small flaw in your argument. Your algorithm may not return a sample of size k, as it may happen that none of the element gets mapped to a specific key. In the extreme case (even if it has small chance), it may happen that all the input elements get mapped to only one key, in which case your algorithm returns only one element.
The event of "missing" a specific key has probability ((k-1)/k)^n = (1-1/k)^n which is approximately (using Taylor approximation) e^{-n/k}. This is negligible if k is much smaller than n, but if k is proportional to n, say k=n/2, then this bad event actually happens with constant probability.

Related

Fast generation of random derangements

I am looking to generate derangements uniformly at random. In other words: shuffle a vector so that no element stays in its original place.
Requirements:
uniform sampling (each derangement is generated with equal probability)
a practical implementation is faster than the rejection method (i.e. keep generating random permutations until we find a derangement)
None of the answers I found so far are satisfactory in that they either don't sample uniformly (or fail to prove uniformity) or do not make a practical comparison with the rejection method. About 1/e = 37% of permutations are derangements, which gives a clue about what performance one might expect at best relative to the rejection method.
The only reference I found which makes a practical comparison is in this thesis which benchmarks 7.76 s for their proposed algorithm vs 8.25 s for the rejection method (see page 73). That's a speedup by a factor of only 1.06. I am wondering if something significantly better (> 1.5) is possible.
I could implement and verify various algorithms proposed in papers, and benchmark them. Doing this correctly would take quite a bit of time. I am hoping that someone has done it, and can give me a reference.
Here is an idea for an algorithm that may work for you. Generate the derangement in cycle notation. So (1 2) (3 4 5) represents the derangement 2 1 4 5 3. (That is (1 2) is a cycle and so is (3 4 5).)
Put the first element in the first place (in cycle notation you can always do this) and take a random permutation of the rest. Now we just need to find out where the parentheses go for the cycle lengths.
As https://mathoverflow.net/questions/130457/the-distribution-of-cycle-length-in-random-derangement notes, in a permutation, a random cycle is uniformly distributed in length. They are not randomly distributed in derangements. But the number of derangements of length m is m!/e rounded up for even m and down for odd m. So what we can do is pick a length uniformly distributed in the range 2..n and accept it with the probability that the remaining elements would, proceeding randomly, be a derangement. This cycle length will be correctly distributed. And then once we have the first cycle length, we repeat for the next until we are done.
The procedure done the way I described is simpler to implement but mathematically equivalent to taking a random derangement (by rejection), and writing down the first cycle only. Then repeating. It is therefore possible to prove that this produces all derangements with equal probability.
With this approach done naively, we will be taking an average of 3 rolls before accepting a length. However we then cut the problem in half on average. So the number of random numbers we need to generate for placing the parentheses is O(log(n)). Compared with the O(n) random numbers for constructing the permutation, this is a rounding error. However it can be optimized by noting that the highest probability for accepting is 0.5. So if we accept with twice the probability of randomly getting a derangement if we proceeded, our ratios will still be correct and we get rid of most of our rejections of cycle lengths.
If most of the time is spent in the random number generator, for large n this should run at approximately 3x the rate of the rejection method. In practice it won't be as good because switching from one representation to another is not actually free. But you should get speedups of the order of magnitude that you wanted.
this is just an idea but i think it can produce a uniformly distributed derangements.
but you need a helper buffer with max of around N/2 elements where N is the size of the items to be arranged.
first is to choose a random(1,N) position for value 1.
note: 1 to N instead of 0 to N-1 for simplicity.
then for value 2, position will be random(1,N-1) if 1 fall on position 2 and random(1,N-2) otherwise.
the algo will walk the list and count only the not-yet-used position until it reach the chosen random position for value 2, of course the position 2 will be skipped.
for value 3 the algo will check if position 3 is already used. if used, pos3 = random(1,N-2), if not, pos3 = random(1,N-3)
again, the algo will walk the list and count only the not-yet-used position until reach the count=pos3. and then position the value 3 there.
this will goes for the next values until totally placed all the values in positions.
and that will generate a uniform probability derangements.
the optimization will be focused on how the algo will reach pos# fast.
instead of walking the list to count the not-yet-used positions, the algo can used a somewhat heap like searching for the positions not yet used instead of counting and checking positions 1 by 1. or any other methods aside from heap-like searching. this is a separate problem to be solved: how to reached an unused item given it's position-count in a list of unused-items.
I'm curious ... and mathematically uninformed. So I ask innocently, why wouldn't a "simple shuffle" be sufficient?
for i from array_size downto 1: # assume zero-based arrays
j = random(0,i-1)
swap_elements(i,j)
Since the random function will never produce a value equal to i it will never leave an element where it started. Every element will be moved "somewhere else."
Let d(n) be the number of derangements of an array A of length n.
d(n) = (n-1) * (d(n-1) + d(n-2))
The d(n) arrangements are achieved by:
1. First, swapping A[0] with one of the remaining n-1 elements
2. Next, either deranging all n-1 remaning elements, or deranging
the n-2 remaining that excludes the index
that received A[0] from the initial matrix.
How can we generate a derangement uniformly at random?
1. Perform the swap of step 1 above.
2. Randomly decide which path we're taking in step 2,
with probability d(n-1)/(d(n-1)+d(n-2)) of deranging all remaining elements.
3. Recurse down to derangements of size 2-3 which are both precomputed.
Wikipedia has d(n) = floor(n!/e + 0.5) (exactly). You can use this to calculate the probability of step 2 exactly in constant time for small n. For larger n the factorial can be slow, but all you need is the ratio. It's approximately (n-1)/n. You can live with the approximation, or precompute and store the ratios up to the max n you're considering.
Note that (n-1)/n converges very quickly.

What's the most efficient way to evenly fill an unsorted list of "buckets" of varying sizes

Suppose I have an unsorted list of buckets. (Each bucket has a size property.) Suppose I have a quantity Q that I must distribute across the list of buckets as evenly as possible (i.e. minimize the maximum).
If the buckets were sorted in increasing size, then the solution would be obvious: fully fill each bucket, say buckets[i], until Q/(buckets.length-i) <= buckets[i]->size, and then fill the remaining buckets with that same quantity, Q/(buckets.length-i), as illustrated:
What's the most efficient way to solve this if the buckets aren't sorted?
I can only think of iterating like this (pseudocode):
while Q > 0
for i in 0..buckets.length-1
q = Q/(buckets.length-i)
if q > buckets[i]->size
q = buckets[i]->size
buckets[i]->fill(q)
Q -= q
But I'm not sure if there's a better way, or if sorting the list would be more efficient.
(The actual problem I face has more to it, e.g. this "unsorted" list is actually sorted by a separate property "rank", which determines which buckets would get the extra fills when quantities don't divide evenly, etc. So, for example, to use the sort-then-fill method, I'd sort the list by bucket size and rank. But knowing an answer to this would help me figure out the rest.)
In many cases, where the solution is "so simple" or "so effective" if the data was sorted, yet very complicated or ineffective in case it isn't, the best solution is very often to just sort the data first and then go for the simple, effective solution. Even though this means you will have the overhead of sorting the data first, there are plenty of very good sort algorithms available for pretty much any purpose and in many cases the total overhead of "first sorting the data and then applying a simple, effective algorithm to it" is still lower than "not sorting the data and applying a very complicated, ineffective algorithm to it".
The fact that you need the data sorted by a different key only means to me, that you need two lists here, each one sorted by a different criteria. Unless we are talking of several thousand buckets here, the memory overhead for a second list will most likely not be an issue (after all both lists only contains pointers to your bucket objects, that means 4/8 bytes per pointer, depending on fact if you have 32 or 64 bit code). One list has the buckets sorted by size the other list has the buckets sorted by "rank" and when when adding new items as described in your question, you use the "sorted by size list", while using the "sorted by rank" list the same way you are using it already by now.
I think it might be possible in linear time, however I'm stuck at a certain point. Maybe you can solve the problem, maybe it can't be solved this way.
Consider the following algorithm.
Based on binary search, we want to find the smallest bucket which isn't fully filled. Finding such a bucket in a list of bucket is maybe possible in linear time, but as I said, I'm stuck here. Once we found that bucket, the rest becomes trivial, since for all smaller buckets we sum up their sizes, subtract it from the total number of items to be placed, divide this by the number of buckets larger or equal to the bucket we just found.
So the following is an attempt to solve the problem: What is the smallest bucket which isn't fully filled? The algorithm is motivated by QuickSelect.
Pick a pivot bucket. See if it is smaller or larger than the bucket we are looking for. (This step is trivial.)
If it is smaller, sum up the sizes of all buckets smaller or equal than this one, subtract this sum from the total number of items and continue the search on the set containing all larger buckets.
If it is larger, we would have to do a similar thing, but now subtract the number of items which are placed in all buckets larger than this one. We don't know the number of items to be placed in these buckets. This is the problem... But if we knew, we'd continue the search on the set containing all smaller buckets.
If this algorithm worked, it would run in expected linear time for random pivot elements (see QuickSelect).
If you can determine q, the appropriate minimum level to fill each bucket such that the total is Q, than the linear solution is clear:
for (bucket b : buckets)
{
int f = max(b.capacity(), q);
b.fill(f);
}
So the problem is determining that level q.
You could binary search for q. That is we know q is an integer between min(b.capacity) and max(b.capacity). ie:
start with a candidate q' half way between min(capacity) and max(capacity)
make a pass of the buckets calculating the total amount Q' resulting from using q'
if (Q' > Q) than repeat with q' reduced by half
if (Q' < Q) than repeat with q' increased by half
return q = q'
Each pass of step 2 is O(N), and there will be log(L) passes where L = max(capacity) - min(capacity)
This works better than sorting when L << N
A sufficient statistic is to reduce the buckets to a histogram:
unordered_set<int,int> bucket_capacity;
for (bucket b : buckets)
bucket_capacity[b.capacity]++;
This is still linear however in the worst case doesn't get us much because the buckets may have distinct sizes, however it bounds the passes by L so the efficiency is now O(min(L,N) * logL)
Again this works well when L << N the efficiency becomes O(LlogL)
I suspect the following is true, but am not 100%: In the case where L >> N it can be shown that there is no linear solution. First we assume we have a linear solution. We then use this solution as a tool to do a comparison sort in linear time. It has been shown comparison sort is impossible in linear time, therefore our assumption must be false, and there is no linear solution.
In one step, you start with n unsorted buckets of finite capacity, k infinite buckets (you store k, not a list of those, and at the first iteration k=0), and an amount of water w. In O(n) time, we are going to reduce the problem to another instance with n', k', w' where n' < c * n for a constant c < 1. Iterating this procedure will solve the problem (once n is a constant, you can solve it in constant time) in linear time: n+c*n+c^2*n+...=O(n).
Among all n finite capacities, pick the median (i.e. pick one such that half of the capacities are higher and half are lower). This can be done in O(n) time (selection algorithm). Compute the sum of 1) the lower capacities and 2) the median capacity multiplied by the number of buckets of higher capacity (including the infinite ones).
If that's less than w, you know you will need to fill the buckets higher, so in particular all the lower capacity buckets will be filled. Remove them, remove the sum of their capacities from w, and you are done for this iteration, n'=n/2.
If on the other hand the sum is larger than w, you know that no bucket will be filled to the median capacity or higher. All buckets of higher capacity can thus be removed and their number added to the number of infinite buckets. w remains unchanged. Again, n'=n/2, and we are done.
A few easy details are skipped (in particular how to handle the case where many buckets have exactly the same capacity) to keep it short. You also need some cleanup at the end, once you know the right level of water, to set it for each "infinite" (i.e. non-full) bucket.
An alternative idea would be as follows. Determine the average number of items per buckets. Then try to fill all buckets with that number (not all buckets can hold that number of items, in general).
Afterwards, you have a number of remaining items to be placed in buckets (because not all have fit in the previous iteration) as well as a list of buckets which can hold more items than they currently contain (calculated in the previous iteration).
Again, calculate the average number of items to be distributed on that remaining buckets based on the remaining number of items to be distributed.
Repeat until you placed all items.
I expect a running time of O(n * log n), but didn't analyze it. It's the same running time than your sort-then-fill method, however, it is expected to be lower if your buckets have only a limited number of different sizes, like: some are small, some are big, some are huge.
Why do you need the list of buckets to be sorted?
Just iterate over the buckets twice.
The first time count up all the sizes. From that you can say, "I want K items in every bucket"
Second time though, fill up the buckets.

How to efficiently *nearly* sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);
Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.
You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x
One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).
If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.
Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.
Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.
Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.
Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

Fastest way for a random unique subset of C++ tr1 unordered_set

This question is related to
this one, and more precisely to this answer to it.
Here goes: I have a C++/TR1 unordered_set U of unsigned ints (rough cardinality 100-50000, rough value range 0 to 10^6).
Given a cardinality N, I want to as quickly as possible iterate over N random but
unique members of U. There is no typical value for N, but it should
work fast for small N.
In more detail, the notion of "randomness" here is
that two calls should produce somewhat different subsets -- the more different,
the better, but this is not too crucial. I would e.g. be happy with a continuous
(or wrapped-around continuous)
block of N members of U, as long as the start index of the block is random.
Non-continuous at the same cost is better, but the main concern is speed. U changes
mildly, but constantly between calls (ca. 0-10 elements inserted/erased between calls).
How far I've come:
Trivial approach: Pick random index i such that (i+N-1) < |U|.
Get an iterator it to U.begin(), advance it i times using it++, and then start
the actual loop over the subset. Advantage: easy. Disadvantage: waste of ++'es.
The bucket approach (and this I've "newly" derived from above link):
Pick i as above, find the bucket b in which the i-th element is in, get a local_iterator lit
to U.begin(b), advance lit via lit++ until we hit the i-th element of U, and from then on keep incrementing lit for N times. If we hit the end of the bucket,
we continue with lit from the beginning of the next bucket. If I want to make it
more random I can pick i completely random and wrap around the buckets.
My open questions:
For point 2 above, is it really the case that I cannot somehow get an
iterator into U once I've found the i-th element? This would spare me
the bucket boundary control, etc. For me as quite a
beginner, it seems unperceivable that the standard forward iterator should know how to
continue traversing U when at the i-th item, but when I found the i-th item myself,
it should not be possible to traverse U other than through point 2 above.
What else can I do? Do you know anything even much smarter/more random? If possible, I don't want to get involved in manual
control of bucket sizes, hash functions, and the like, as this is a bit over my head.
Depending on what runtime guarantees you want, there's a famous O(n) algorithm for picking k random elements out of a stream of numbers in one pass. To understand the algorithm, let's see it first for the case where we want to pick just one element out of the set, then we'll generalize it to work for picking k elements. The advantage of this approach is that it doesn't require any advance knowledge of the size of the input set and guarantees provably uniform sampling of elements, which is always pretty nice.
Suppose that we want to pick one element out of the set. To do this, we'll make a pass over all of the elements in the set and at each point will maintain a candidate element that we're planning on returning. As we iterate across the list of elements, we'll update our guess with some probability until at the very end we've chosen a single element with uniform probability. At each point, we will maintain the following invariant:
After seeing k elements, the probability that any of the first k elements is currently chosen as the candidate element is 1 / k.
If we maintain this invariant across the entire array, then after seeing all n elements, each of them has a 1 / n chance of being the candidate element. Thus the candidate element has been sampled with uniformly random probability.
To see how the algorithm works, let's think about what it has to do to maintain the invariant. Suppose that we've just seen the very first element. To maintain the above invariant, we have to choose it with probability 1, so we'll set our initial guess of the candidate element to be the first element.
Now, when we come to the second element, we need to hold the invariant that each element is chosen with probability 1/2, since we've seen two elements. So let's suppose that with probability 1/2 we choose the second element. Then we know the following:
The probability that we've picked the second element is 1/2.
The probability that we've picked the first element is the probability that we chose it the first time around (1) times the probability that we didn't just pick the second element (1/2). This comes out to 1/2 as well.
So at this point the invariant is still maintained! Let's see what happens when we come to the third element. At this point, we need to ensure that each element is picked with probability 1/3. Well, suppose that with probability 1/3 we choose the last element. Then we know that
The probability that we've picked the third element is 1/3.
The probability that we've picked either of the first two elements is the probability that it was chosen after the first two steps (1/2) times the probability that we didn't choose the third element (2/3). This works out to 1/3.
So again, the invariant holds!
The general pattern here looks like this: After we've seen k elements, each of the elements has a 1/k chance of being picked. When we see the (k + 1)st element, we choose it with probability 1 / (k + 1). This means that it's chosen with probability 1 / (k + 1), and all of the elements before it are chosen with probability equal to the odds that we picked it before (1 / k) and didn't pick the (k + 1)st element this time (k / (k + 1)), which gives those elements each a probability of 1 / (k + 1) of being chosen. Since this maintains the invariant at each step, we've got ourselves a great algorithm:
Choose the first element as the candidate when you see it.
For each successive element, replace the candidate element with that element with probability 1 / k, where k is the number of elements seen so far.
This runs in O(n) time, requires O(1) space, and gives back a uniformly-random element out of the data stream.
Now, let's see how to scale this up to work if we want to pick k elements out of the set, not just one. The idea is extremely similar to the previous algorithm (which actually ends up being a special case of the more general one). Instead of maintaining one candidate, we maintain k different candidates, stored in an array that we number 1, 2, ..., k. At each point, we maintain this invariant:
After seeing m > k elements, the probability that any of the first m elements is chosen is
k / m.
If we scan across the entire array, this means that when we're done, each element has probability k / n of being chosen. Since we're picking k different elements, this means that we sample the elements out of the array uniformly at random.
The algorithm is similar to before. First, choose the first k elements out of the set with probability 1. This means that when we've seen k elements, the probability that any of them have been picked is 1 = k / k and the invariant holds. Now, assume inductively that the invariant holds after m iterations and consider the (m + 1)st iteration. Choose a random number between 1 and (m + 1), inclusive. If we choose a number between 1 and k (inclusive), replace that candidate element with the next element. Otherwise, do not choose the next element. This means that we pick the next element with probability k / (m + 1) as required. The probability that any of the first m elements are chosen is then the probability that they were chosen before (k / m) times the probability that we didn't choose the slot containing that element (m / (m + 1)), which gives a total probability of being chosen of k / (m + 1) as required. By induction, this proves that the algorithm perfectly uniformly and randomly samples k elements out of the set!
Moreover, the runtime is O(n), which is proportional to the size of the set, which is completely independent of the number of elements you want to choose. It also uses only O(k) memory and makes no assumptions whatsoever about the type of the elements being stored.
Since you're trying to do this for C++, as a shameless self-promotion, I have an implementation of this algorithm (written as an STL algorithm) available here on my personal website. Feel free to use it!
Hope this helps!

How to ensure that randomly generated numbers are not being repeated? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Unique (non-repeating) random numbers in O(1)?
How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N
I want to generate random number in a certain diapason, and I must be sure, that each new number is not a duplicate of formers. One solution is to store formerly generated numbers in a container and each new number checks aginst the container. If there is such number in the container, then we generate agin, else we use and add it to the container. But with each new number this operation is becoming slower and slower. Is there any better approach, or any rand function that can work faster and ensure uniqueness of the generation?
EDIT: Yes, there is a limit (for example from 0 to 1.000.000.000). But I want to generate 100.000 unique numbers! (Would be great if the solution will be by using Qt features.)
Is there a range for the random numbers? If you have a limit for random numbers and you keep generating unique random numbers, then you'll end up with a list of all numbers from x..y in random order, where x-y is the valid range of your random numbers. If this is the case, you might improve speed greatly by simply generating the list of all numbers x..y and shuffling it, instead of generating the numbers.
I think there are 3 possible approaches, depending on range-size, and performance pattern needed you can use another algorithm.
Create a random number, see if it is in (a sorted) list. If not add and return, else try another.
Your list will grow and consume memory with every number you need. If every number is 32 bit, it will grow with at least 32 bits every time.
Every new random number increases the hit-ratio and this will make it slower.
O(n^2) - I think
Create an bit-array for every number in the range. Mark with 1/True if already returned.
Every number now only takes 1 bit, this can still be a problem if the range is big, but every number now only allocates 1 bit.
Every new random number increases the hit-ratio and this will make it slower.
O(n*2)
Pre-populate a list with all the numbers, shuffle it, and return the Nth number.
The list will not grow, returning numbers will not get slower,
but generating the list might take a long time, and a lot of memory.
O(1)
Depending on needed speed, you could store all lists in a database. There's no need for them to be in memory except speed.
Fill out a list with the numbers you need, then shuffle the list and pick your numbers from one end.
If you use a simple 32-bit linear congruential RNG (such as the so-called "Minimal Standard"), all you have to do is store the seed value you use and compare each generated number to it. If you ever reach that value again, your sequence is starting to repeat itself and you're out of values. This is O(1), but of course limited to 2^32-1 values (though I suppose you could use a 64-bit version as well).
There is a class of pseudo-random number generators that, I believe, has the properties you want: the Linear congruential generator. If defined properly, it will produce a list of integers from 0 to N-1, with no two numbers repeating until you've used all of the numbers in the list once.
#include <stdint.h>
/*
* Choose these values as follows:
*
* The MODULUS and INCREMENT must be relatively prime.
* The MULTIPLIER-1 must be divisible by all prime factors of the MODULUS.
* The MULTIPLIER-1 must be divisible by 4, if the MODULUS is divisible by 4.
*
* In addition, modulus must be <= 2**32 (0x0000000100000000ULL).
*
* A small example would be 8, 5, 3.
* A larger example would be 256, 129, 251.
* A useful example would be 0x0000000100000000ULL, 1664525, 1013904223.
*/
#define MODULUS (0x0000000100000000ULL)
#define MULTIPLIER (1664525)
#define INCREMENT (1013904223)
static uint64_t seed;
uint32_t lcg( void ) {
uint64_t temp;
temp = seed * MULTIPLIER + INCREMENT; // 64-bit intermediate product
seed = temp % MODULUS; // 32-bit end-result
return (uint32_t) seed;
}
All you have to do is choose a MODULUS such that it is larger than the number of numbers you'll need in a given run.
It wouldn't be random if there is such a pattern?
As far as I know you would have to store and filter all unwanted numbers...
unsigned int N = 1000;
vector <unsigned int> vals(N);
for(unsigned int i = 0; i < vals.size(); ++i)
vals[i] = i;
std::random_shuffle(vals.begin(), vals.end());
unsigned int random_number_1 = vals[0];
unsigned int random_number_2 = vals[1];
unsigned int random_number_3 = vals[2];
//etc
You could store the numbers in a vector, and get them by index (1..n-1). After each random generation, remove the indexed number from the vector, then generate the next number in the interval 1..n-2. etc.
If they can't be repeated, they aren't random.
EDIT:
Furthermore..
if they can't be repeated, they don't fit in a finite computer
How many random numbers do you need? Maybe you can apply a shuffle algorithm to a precalculated array of random numbers?
There is no way a random generator will output values depending on previously outputted values, because they wouldn't be random. However, you can improve performance by using different pools of random values each with values combined by a different salt value, which will divide the quantity of numbers to check by the quantity of pools you have.
If the range of the random number doesn't matter you could use a really large range of random numbers and hope you don't get any collisions. If your range is billions of times larger than the number of elements you expect to create your chances of a collision are small but still there. If the numbers don't to have an actual random distribution you could have a two part number {counter}{random x digits} that would ensure a unique number but it wouldn't be randomly distributed.
There's not going to be a pure functional approach that isn't O(n^2) on the number of results returned so far - every time a number is generated you will need to check against every result so far. Additionally, think about what happens when you're returning e.g. the 1000th number out of 1000 - you will require on average 1000 tries until the random algorithm comes up with the last unused number, with each attempt requiring an average of 499.5 comparisons with the already-generated numbers.
It should be clear from this that your description as posted is not quite exactly what you want. The better approach, as others have said, is to take a list of e.g. 1000 numbers upfront, shuffle it, and then return numbers from that list incrementally. This will guarantee you're not returning any duplicates, and return the numbers in O(1) time after the initial setup.
You can allocate enough memory for array of bits with 1 bit for each possible number. and check/set bits for every generated number. for example for numbers from 0 to 65535 you will need only 8192 (8kb) of memory.
Here's an interesting solution I came up with:
Assume you have numbers 1 to 1000 - and you don't have enough memory.
You could put all 1000 numbers into an array, and remove them one by one, but you'll get memory overflow error.
You could split the array in two, so you have an array of 1-500 and one empty array
You could then check if the number exists in array 1, or doesn't exist in the second array.
So assuming you have 1000 numbers, you can get a random number from 1-1000. If its less than 500, check array 1 and remove it if present. If it's NOT in array 2, you can add it.
This halves your memory usage.
If you propogate this using recursion, you can split your 500 array into a 250 and empty array.
Assuming empty arrays use no space, you can decrease your memory usage quite a bit.
Searching will be massively faster too, because if you break it down a lot, you generate a number such as 29. It's less than 500, less than 250, less than 125, less than 62, less than 31, greater than 15, so you do those 6 calculations, then check the array containing an average of 16/2 items - 8 in total.
I should patent this search, although I bet it already exists!
Especially given the desired number of values, you want a Linear Feedback Shift Register.
Why?
No shuffle step, nor a need to keep track of values you've already hit. As long as you go less than the full period, you should be fine.
It turns out that the Wikipedia article has some C++ code examples which are more tested than anything I would give you off the top of my head. Note that you'll want to be pulling values from inside the loops -- the loops just iterate the shift register through. You can see this in the snippet here.
(Yes, I know this was mentioned, briefly in the dupe -- saw it as I was revising. Given it hasn't been brought up here and is the best way to solve the poster's question, I think it should be brought up again.)
Let's say size=100.000 then create an array with this size. Create random numbers then put them into array.Problem is which index that number will be ? randomNumber%size will give you index.
When u put next number, use that function for index and check this value is exist or not. If not exist put it if exist then create new number and try that. U can create in fastest way with this way. Disadvange of this way is you will never find numbers which last section is same.
For example for last sections is
1231232444556
3458923444556
you will never have such numbers in your list even if they are totally different but last sections are same.
First off, there's a huge difference between random and pseudorandom. There's no way to generate perfectly random numbers from a deterministic process (such as a computer) without bringing in some physical process like latency between keystrokes or another entropy source.
The approach of saving all the numbers generated will slow down the computation rather quickly; the more numbers you have, the larger your storage needs, until you've filled up all available memory. A better method would be (as someone's already suggested) using a well known pseudorandom number generator such as the Linear Congruential Generator; it's super fast, requiring only modular multiplication and addition, and the theory behind it gets a lot of mention in Vol. 2 of Knuth's TAOCP. That way, the theory involved guarantees a rather large period before repetition, and the only storage needed are the parameters and seed used.
If you have no problem when a value can be calculated by the previous one, LFSR and LCG are fine. When you don't want that one output value can be calculated by another, you can use a block cipher in counter mode to generate the output sequence, given that the cipher block length is equal to the output length.
Use Hashset generic class . This class does not contain same values. You can put in all of your generated numbers then u can use them in Hashset.You can also check it if it is exist or not .Hashset can determine existence of items in fastest way.Hashset does not slow when list become bigger and this is biggest feature of it.
For example :
HashSet<int> array = new HashSet<int>();
array.Add(1);
array.Add(2);
array.Add(1);
foreach (var item in array)
{
Console.WriteLine(item);
}
Console.ReadKey();