Random Permutations - c++

I am having trouble figuring out a decent way of randomly shuffling the elements in an std::vector and, after some operations, restoring the original order. I know that this should be a rather trivial algorithm, but I guess I'm too tired...
Since I am constrained to use a custom random number generator class, I guess I can't use std::random_shuffle, which doesn't help anyway, because I also need to preserve the original order. So, my approach was to create an std::map which serves as a mapping between the original positions and the random ones, like this:
std::map<unsigned int, unsigned int> getRandomPermutation (const unsigned int &numberOfElements)
{
std::map<unsigned int, unsigned int> permutation;
//populate the map
for (unsigned int i = 0; i < numberOfElements; i++)
{
permutation[i] = i;
}
//randomize it
for (unsigned int i = 0; i < numberOfElements; i++)
{
//generate a random number in the interval [0, numberOfElements)
unsigned long randomValue = GetRandomInteger(numberOfElements - 1U);
//broken swap implementation
//permutation[i] = randomValue;
//permutation[randomValue] = i;
//use this instead:
std::swap(permutation[i], permutation[randomValue]);
}
return permutation;
}
I am not sure that the above algorithm is a proper implementation for a random permutation, so any improvements are welcome.
Now, here is how I've managed to make use of this permutation map:
std::vector<BigInteger> doStuff (const std::vector<BigInteger> &input)
{
/// Permute the values in a random order
std::map<unsigned int, unsigned int> permutation = getRandomPermutation(static_cast<unsigned int>(input.size()));
std::vector<BigInteger> temp;
//permute values
for (unsigned int i = 0; i < static_cast<unsigned int>(input.size()); ++i)
{
temp.push_back(input[permutation[i]]);
}
//do all sorts of stuff with temp
/// Reverse the permutation
std::vector<BigInteger> output;
for (unsigned int i = 0; i < static_cast<unsigned int>(input.size()); ++i)
{
output.push_back(temp[permutation[i]]);
}
return output;
}
Something tells me that I should be able to use only one std::vector<BigInteger> for this algorithm, but, right now, I just can't figure out the optimal solution. Honestly, I don't really care about the data in input, so I could even make it non-const, overwrite it, and skip creating a copy of it, but the question is how to implement the algorithm?
If I do something like this, I end up shooting myself in the foot, right? :)
for (unsigned int i = 0; i < static_cast<unsigned int>(input.size()); ++i)
{
BigInteger aux = input[i];
input[i] = input[permutation[i]];
input[permutation[i]] = aux;
}
EDIT: Following Steve's remark about using "Fisher-Yates" shuffle, I changed my getRandomPermutation function accordingly:
std::map<unsigned int, unsigned int> getRandomPermutation (const unsigned int &numberOfElements)
{
std::map<unsigned int, unsigned int> permutation;
//populate the map
for (unsigned int i = 0; i < numberOfElements; i++)
{
permutation[i] = i;
}
//randomize it
for (unsigned int i = numberOfElements - 1; i > 0; --i)
{
//generate a random number in the interval [0, numberOfElements)
unsigned long randomValue = GetRandomInteger(i);
std::swap(permutation[i], permutation[randomValue]);
}
return permutation;
}

If you're "randomising" a vector of n elements, you can create another std::vector<size_t> index(n), set index[x] = x for 0 <= x < n, then shuffle index. Then your lookups take the form: original_vector[index[i]]. The order of the original vector's never changed so no need to restore ordering.
...constrained to use a custom random number generator class, I guess I can't use std::random_shuffle...
Have you noticed this overload?
template <class RandomAccessIterator, class RandomNumberGenerator>
void random_shuffle ( RandomAccessIterator first, RandomAccessIterator last,
RandomNumberGenerator& rand );
For details of how to wrap your random number generator with a compatible object, see http://www.sgi.com/tech/stl/RandomNumberGenerator.html

If you're looking for specific errors in your code:
permutation[i] = randomValue;
permutation[randomValue] = i;
is wrong. Observe that once you're finished, each value does not necessarily appear exactly once among the values of the map. So it's not a permutation, let alone a uniformly-distributed random one.
The proper means to generate a random permutation is what Tony says, use std::random_shuffle on a vector that initially represents the identity permutation. Or if you want to know how a shuffle is properly performed, look up "Fisher-Yates". In general, any approach that makes N random selections uniformly from 0 .. N-1 is doomed to failure, because that means it has N^N possible ways it can run. But there are N! possible permutations of N items, and N^N is generally not divisible by N!. Hence it's impossible for each permutation to be the result of an equal number of random selections, i.e. the distribution is not uniform.
the question is how to implement the algorithm?
So, you have your permutation, and you want to re-order the elements of input in-place, according to that permutation.
The key thing to know is that every permutation is a composition of "cycles". That is to say, if you repeatedly follow the permutation from a given starting point, you come back to where you started (and this path is the cycle to which that starting point belongs). There may be more than one such cycle in a given permutation, and if permutation[i] == i for some i, then the cycle of i has length 1.
The cycles are all disjoint, that is to say each element appears in exactly one cycle. Because cycles don't "interfere" with each other, we can apply a permutation by applying each cycle, and we can do the cycles in any order. So, for each index i we need to:
check whether we've already done i. If so, move on to the next index.
set current = i
swap index[current] with index[permutation[current]]. So index[current] is set to its correct value (the next element in the cycle), and its old value is "pushed" forward along the cycle.
mark current as "done"
if permutuation[current] is i, we've finished the cycle. So the first value of the cycle ends up in the spot formerly occupied by the last element of the cycle, which is right. Move on to the next index.
set current = permutation[current] and go back to the swap step.
Depending on the types involved, you can optimize around the swaps - it may be better to copy/move to a temporary variable and the start of each cycle, then do a copy/move instead of a swap at each step of the cycle, and finally copy/move the temporary to the end of the cycle.
Reversing the process is the same, but using the "inverse" of the permutation. The inverse inv of a permutation perm, is the permutation such that inv[perm[i]] == i for each i. You can either compute the inverse and use the exact code above, or you can use code similar to the above, except move the elements in the opposite direction along each cycle.
An alternative to all that, since you implemented Fisher-Yates yourself -- as you're running Fisher-Yates, for each swap you perform record the two indices swapped in a vector<pair<size_t,size_t>>. Then you don't have to worry about cycles. You can apply the permutation to the vector by applying the same sequence of swaps. You can reverse the permutation by applying the reversed sequence of swaps.

Note that, depending on your application, if it is important that you have a truly uniformly distributed permutation, you cannot use any algorithm that calls a typical pseudo-random number generator more that once.
The reason is that most pseudo-random number generators, such as the one in clib, are Linear congruential. Those have a weakness where they'll generate numbers that cluster in planes - so your permutations will not be perfectly uniformly distributed. Using a higher-quality generator should get around that.
See http://en.wikipedia.org/wiki/Linear_congruential_generator
Alternatively, you could just generate a single random number in the range 0..(n!-1) and pass it to the unrank function for permutations. For small enough n, you can store those and get a constant time algorithm, but if n is too large for that, the best unrank function is O(n). Applying the resulting permutation is going to be O(n) anyway.

Given a ordered sequence of elements a,b,c,d,e you first create a new indexed sequence: X=(0,a),(1,b),(2,c),(3,d),(4,e). Then, you randomly shuffle that sequence and obtain the second element of each pair to get the random sequence. To restore the original sequence you sort the X set incrementally using the first element of each pair.

Related

C++ How to generate 10,000 UNIQUE random integers to store in a BST?

I am trying to generate 10,000 unique random integers in the range of 1 to 20,000 to store in a BST, but not sure the best way to do this.
I saw some good suggestions on how to do it with an array or a vector, but not for a BST. I have a contains method but I don't believe it will work in this scenario as it is used to search and return results on how many tries it took to find the desired number. Below is the closest I've gotten but it doesn't like my == operator. Would it be better to use an array and just store the array in the BST? Or is there a better way to use the below code so that while it's generating the numbers it's just storing them right in the tree?
for (int i = 0; i < 10000; i++)
{
int random = rand() % 20000;
tree1Ptr->add(random);
for (int j = 0; j < i; j++) {
if (tree1Ptr[j]==random) i--;
}
}
There are a couple of problems in your code. But let's go straight to the hurting point.
What's the main problem ?
From your code, it is obvious that tree1Ptr is a pointer. In principle, it should point to a node of the tree, which has two pointers, one to the left node and one to the right node.
So somewhere in your code, you should have:
tree1Ptr = new Node; // or whatever the type of your node is called
However, in your inner loop, you are just using it as if it was an array:
for (int i = 0; i < 10000; i++)
{
int random = rand() % 20000;
tree1Ptr->add(random);
for (int j = 0; j < i; j++) {
if (tree1Ptr[j]==random) //<============ OUCH !!
i--;
}
}
The compiler won't complain, because it's valid syntax: you can use array indexing on a pointer. But it's up to you to ensure that you don not go out of bounds (so here, that j remains <1).
Other remarks
By the way, in the inner loop, you just want to say that you have to retry if the number is found. You can break the inner loop if the number is already found, in order not to continue.
You should also seed your random number generator, to avoid running the program always with the same sequence.
How to solve it ?
You really need to deepen your understanding of BST. Navigating through the node requires make comparison with the value in the current node, and depending on the result, iterate continuing either with the left or the right pointer, not using indexing. But it would be too long to explain here. So may be you should look for a tutorial, like this one
For a lot of unique 'random' numbers I usually use a Format Preserving Encryption. Since encryption is one-to-one, you are guaranteed unique outputs as long as the inputs are unique. A different encryption key will generate a different set of outputs, i.e. a different permutation of the inputs. Simply encrypt 0, 1, 2, 3, 4, ... and the outputs are guaranteed unique.
You want numbers in the range [1 .. 20,000]. Unfortunately 20,000 needs 21 bits and most encryption schemes have an even number of bits: 22 bits in your case. That means you will need to cycle walk; re-encrypt the output if the number is too big until you get a number in the desired range. Since your inputs only go up to 10,000 and you will be cycle walking above 20,000 you will still avoid duplicates.
The only standard cipher I know of which allows a 22 bit block size is Hasty Pudding cipher. Alternatively it is easy enough to write your own simple Feistel cipher. Four rounds are enough if you do not want cryptographic security. For crypto level security you will need to use AES/FFX, which is NIST approved.
There are two ways where you can pick random unique numbers out of a sequence without checking against the numbers previously picked (i.e. already in your BST).
Use random_shuffle
A simple way is to shuffle a sorted array of 1 ... 20,000 and simply pick the first 10,000 items:
#include <algorithm>
#include <vector>
std::vector<int> values(20000);
for (int i = 0; i < 20000; ++i) {
values[i] = i+1;
}
std::random_shuffle(values.begin(), values.end());
for (int i = 0; i < 10000; ++i) {
// Insert values[i] into your BST
}
This method works well if the size of random numbers (10,000) to pick is comparable to the size of total numbers (20,000), because the complexity of random shuffling is amortized over a larger result set.
Use uniform_int_distribution
If the size of random numbers to pick is much smaller than the size of total numbers, then an alternative way can be used:
#include <chrono>
#include <random>
#include <vector>
// Use timed seed so every run produces different random picks.
std::default_random_engine reng(
std::chrono::steady_clock::now().time_since_epoch().count());
int num_pick = 1000; // # of random numbers remained to pick
int num_total = 20000; // Total # of numbers to pick from
int cur_value = 1; // Current prospective number to be picked
while (num_pick > 0) {
// Probability to pick `cur_value` is num_pick / (num_total-cur_value+1)
std::uniform_int_distribution<int> distrib(0, num_total-cur_value);
if (distrib(reng) < num_pick) {
bst.insert(cur_value); // insert `cur_value` to your BST
--num_pick;
}
++cur_value;
}

Efficiently find an integer not in a set of size 40, 400, or 4000

Related to the classic problem find an integer not among four billion given ones but not exactly the same.
To clarify, by integers what I really mean is only a subset of its mathemtical definition. That is, assume there are only finite number of integers. Say in C++, they are int in the range of [INT_MIN, INT_MAX].
Now given a std::vector<int> (no duplicates) or std::unordered_set<int>, whose size can be 40, 400, 4000 or so, but not too large, how to efficiently generate a number that is guaranteed to be not among the given ones?
If there is no worry for overflow, then I could multiply all nonzero ones together and add the product by 1. But there is. The adversary test cases could delibrately contain INT_MAX.
I am more in favor of simple, non-random approaches. Is there any?
Thank you!
Update: to clear up ambiguity, let's say an unsorted std::vector<int> which is guaranteed to have no duplicates. So I am asking if there is anything better than O(n log(n)). Also please note that test cases may contain both INT_MIN and INT_MAX.
You could just return the first of N+1 candidate integers not contained in your input. The simplest candidates are the numbers 0 to N. This requires O(N) space and time.
int find_not_contained(container<int> const&data)
{
const int N=data.size();
std::vector<char> known(N+1, 0); // one more candidates than data
for(int i=0; i< N; ++i)
if(data[i]>=0 && data[i]<=N)
known[data[i]]=1;
for(int i=0; i<=N; ++i)
if(!known[i])
return i;
assert(false); // should never be reached.
}
Random methods can be more space efficient, but may require more passes over the data in the worst case.
Random methods are indeed very efficient here.
If we want to use a deterministic method and by assuming the size n is not too large, 4000 for example, then we can create a vector x of size m = n + 1 (or a little bit larger, 4096 for example to facilitate calculation), initialised with 0.
For each i in the range, we just set x[array[i] modulo m] = 1.
Then a simple O(n) search in x will provide a value which is not in array
Note: the modulo operation is not exactly the "%" operation
Edit: I mentioned that calculations are made easier by selecting here a size of 4096. To be more concrete, this implies that the modulo operation is performed with a simple & operation
You can find the smallest unused integer in O(N) time using O(1) auxiliary space if you are allowed to reorder the input vector, using the following algorithm. [Note 1] (The algorithm also works if the vector contains repeated data.)
size_t smallest_unused(std::vector<unsigned>& data) {
size_t N = data.size(), scan = 0;
while (scan < N) {
auto other = data[scan];
if (other < scan && data[other] != other) {
data[scan] = data[other];
data[other] = other;
}
else
++scan;
}
for (scan = 0; scan < N && data[scan] == scan; ++scan) { }
return scan;
}
The first pass guarantees that if some k in the range [0, N) was found after position k, then it is now present at position k. This rearrangement is done by swapping in order to avoid losing data. Once that scan is complete, the first entry whose value is not the same as its index is not referenced anywhere in the array.
That assertion may not be 100% obvious, since a entry could be referenced from an earlier index. However, in that case the entry could not be the first entry unequal to its index, since the earlier entry would be meet that criterion.
To see that this algorithm is O(N), it should be observed that the swap at lines 6 and 7 can only happen if the target entry is not equal to its index, and that after the swap the target entry is equal to its index. So at most N swaps can be performed, and the if condition at line 5 will be true at most N times. On the other hand, if the if condition is false, scan will be incremented, which can also only happen N times. So the if statement is evaluated at most 2N times (which is O(N)).
Notes:
I used unsigned integers here because it makes the code clearer. The algorithm can easily be adjusted for signed integers, for example by mapping signed integers from [INT_MIN, 0) onto unsigned integers [INT_MAX, INT_MAX - INT_MIN) (The subtraction is mathematical, not according to C semantics which wouldn't allow the result to be represented.) In 2's-complement, that's the same bit pattern. That changes the order of the numbers, of course, which affects the semantics of "smallest unused integer"; an order-preserving mapping could also be used.
Make random x (INT_MIN..INT_MAX) and test it against all. Test x++ on failure (very rare case for 40/400/4000).
Step 1: Sort the vector.
That can be done in O(n log(n)), you can find a few different algorithms online, use the one you like the most.
Step 2: Find the first int not in the vector.
Easily iterate from INT_MIN to INT_MIN + 40/400/4000 checking if the vector has the current int:
Pseudocode:
SIZE = 40|400|4000 // The one you are using
for (int i = 0; i < SIZE; i++) {
if (array[i] != INT_MIN + i)
return INT_MIN + i;
The solution would be O(n log(n) + n) meaning: O(n log(n))
Edit: just read your edit asking for something better than O(n log(n)), sorry.
For the case in which the integers are provided in an std::unordered_set<int> (as opposed to a std::vector<int>), you could simply traverse the range of integer values until you come up against one integer value that is not present in the unordered_set<int>. Searching for the presence of an integer in an std::unordered_set<int> is quite straightforward, since std::unodered_set does provide searching through its find() member function.
The space complexity of this approach would be O(1).
If you start traversing at the lowest possible value for an int (i.e., std::numeric_limits<int>::min()), you will obtain the lowest int not contained in the std::unordered_set<int>:
int find_lowest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::min(); ; ++i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Analogously, if you start traversing at the greatest possible value for an int (i.e., std::numeric_limits<int>::max()), you will obtain the lowest int not contained in the std::unordered_set<int>:
int find_greatest_not_contained(const std::unordered_set<int>& set) {
for (auto i = std::numeric_limits<int>::max(); ; --i) {
auto it = set.find(i); // search in set
if (it == set.end()) // integer not in set?
return *it;
}
}
Assuming that the ints are uniformly mapped by the hash function into the unordered_set<int>'s buckets, a search operation on the unordered_set<int> can be achieved in constant time. The run-time complexity would then be O(M ), where M is the size of the integer range you are looking for a non-contained value. M is upper-bounded by the size of the unordered_set<int> (i.e., in your case M <= 4000).
Indeed, with this approach, selecting any integer range whose size is greater than the size of the unordered_set, is guaranteed to come up against an integer value which is not present in the unordered_set<int>.

How do I print out vectors in different order every time

I'm trying to make two vectors. Where vector1 (total1) is containing some strings and vector2(total2) is containing some random unique numbers(that are between 0 and total1.size() - 1)
I want to make a program that print out total1s strings, but in different order every turn. I don't want to use iterators or something because I want to improve my problem solving capacity.
Here is the specific function that crash the program.
for (unsigned i = 0; i < total1.size();)
{
v1 = rand() % total1.size();
for (unsigned s = 0; s < total1.size(); ++s)
{
if (v1 == total2[s])
;
else
{
total2.push_back(v1);
++i;
}
}
}
I'm very grateful for any help that I can get!
Can I suggest you change of algorithm?. Because, even if your current one is correctly implemented ("s", in your code, must go from 0 to total2.size not total1.size and if element is found, break and generate a new random), it has the following drawback: assume vectors of 1.000.000 elements and you are trying the last random number. You have one probability in 1.000.000 of find a random number not previously used. That is a very small amount.Last but one number has a probability of 2 in 1.000.000 also small. In conclusion, your program will loop and expend lots of CPU resources.
Your best alternative is follow #NathanOliver suggestion and look for function std::shuffle. The manual page shows the implementation algorithm, that is what you are looking for.
Another simple algorithm, with some pros and cons, is:
init total2 with sequence 0,1,2,...,n where n is the size total1 - 1
choice two random numbers, i1 and i2, in range [0,n-1].
Swap elements i1 and i2 in total2.
repeat from (2) a fixed number of times "R".
This method allows to known a priori the necessary steps and to control the level of "randomness" of the final vector (bigger R is more random). However, it is far to be good in its randomness quality.
Another method, better in the probabilistic distribution:
fill a list L with number 0,1,2,...size total1-1.
choice a random number i between 0 and the size of list L - 1 .
Store in total2 the i-th element in list L.
Remove this element from L.
repeat from (2) until L is empty.
If you just want to shuffle vector<string> total1, you can do this without using helping vector<int> total2. Here is an implementation based on Fisher–Yates shuffle.
for(int i=n-1; i>=1; i--) {
int j=rand()%(i+1);
swap(total1[j], total1[i]); // your prof might not allow use of swap:)
}
If you must use vector<int> total2 then shuffle it using above algorithm. Next you can use it to create a new vector<string> result from total1 where result[i]=total1[total2[i]].

Randomly select index from a STL vector from truth value

I have a vector that looks like:
vector<int> A = {0, 1, 1, 0, 0, 1, 0, 1};
I'd like to select a random index from the non-zero values of A. Using this example A, I want to randomly select an element from the array {1,2,5,7}.
Currently I do this by creating another array
vector<int> b;
for(int i=0;i<A.size();i++)
if(A[i])
b.push_back(i);
Once b is created, I find the index by using this answer:
get random element from container
Is there a more STL-like (or C++11) way of doing this, perhaps one that does not create an intermediate array? In this example A is small, but in my production code this selection process is in an inner-loop and A is non-static and thousands of elements long.
A great way to do this is Reservoir Sampling.
In short, you walk your array until you find the first non-zero value, and record that index as the first possible answer you might return.
Then, you continue to walk the array. Every time you find a non-zero value, you randomly might change which new index is your possible answer, with decreasing probability.
This algorithm also works great if you need M random index values from your array.
What's great about this, is that you walk each element only one time, and you don't need a separate memory structure to record the non-zero elements. It's O(N) in speed, and O(M) in memory, in your case it's O(1) in memory, since you only want 1 random value.
On the flip side, random number generators are traditionally quite slow. So, you might want to performance test this against any other ideas people come up with here, to see if the trade-off of speed-vs-memory is worth it for you.
With a single pass through the array, you can determine how many false (or true) values there are. If you are doing this kind of thing often, you can even write a class to keep track of this for you.
Regardless, you can then pick a random number i between 0 and num_false (or num_true). Then with another pass through the array, you can return the ith false (or true) index.
We can loop through each non-zero value and assign it a random number. The index with the largest random number is the one we select.
int value = 0;
int index = 0;
while(int i = 0; i < A.size(); i++) {
if(!A[i]) continue;
auto j = rand();
if(j > value) {
index = i;
value = j;
}
}
vector<int> A = {0,1,1,0,0,1,0,1};
random_shuffle(A.begin(),A.end());
auto it = find_if(A.begin(),A.end(),[](const int elem){return elem;});

Fast generation of random set, Monte Carlo Simulation

I have a set of numbers ~100, I wish to perform MC simulation on this set, the basic idea is I fully randomize the set, do some comparison/checks on the first ~20 values, store the result and repeat.
Now the actual comparison/check algorithm is extremely fast it actually completes in about 50 CPU cycles. With this in mind, and in order to optimize these simulations I need to generate the random sets as fast as possible.
Currently I'm using a Multiply With Carry algorithm by George Marsaglia which provides me with a random integer in 17 CPU cycles, quite fast. However, using the Fisher-Yates shuffling algorithm I have to generate 100 random integers, ~1700 CPU cycles. This overshadows my comparison time by a long ways.
So my question is are there other well known/robust techniques for doing this type of MC simulation, where I can avoid the long random set generation time?
I thought about just randomly choosing 20 values from the set, but I would then have to do collision checks to ensure that 20 unique entries were chosen.
Update:
Thanks for the responses. I have another question with regards to a method I just came up with after my post. The question is, will this provide a robust truly (assuming the RNG is good) random output. Basically my method is to set up an array of integer values the same length as my input array, set every value to zero. Now I begin randomly choosing 20 values from the input set like so:
int pcfast[100];
memset(pcfast,0,sizeof(int)*100);
int nchosen = 0;
while (nchosen<20)
{
int k = rand(100); //[0,100]
if ( pcfast[k] == 0 )
{
pcfast[k] = 1;
r[nchosen++] = s[k]; // r is the length 20 output, s the input set.
}
}
Basically what I mentioned above, choosing 20 values at random, except it seems like a somewhat optimized way of ensuring no collisions. Will this provide good random output? Its quite fast.
If you only use the first 20 values in the randomised array, then you only need to do 20 steps of the Fisher-Yates algorithm (Knuth's version). Then 20 values have been randomised (actually at the end of the array rather than at the beginning, in the usual formulation), in the sense that the remaining 80 steps of the algorithm are guaranteed not to move them. The other 80 positions aren't fully shuffled, but who cares?
C++ code (iterators should be random-access):
using std::swap;
template <typename Iterator, typename Rand> // you didn't specify the type
void partial_shuffle(Iterator first, Iterator middle, Iterator last, Rand rnd) {
size_t n = last - first;
while (first != middle) {
size_t k = rnd(n); // random integer from 0 to n-1
swap(*(first+k),*first);
--n;
++first;
}
}
On return, the values from first through to middle-1 are shuffled. Use it like this:
int arr[100];
for (int i = 0; i < 100; ++i) arr[i] = i;
while (need_more_samples()) {
partial_shuffle(arr, arr+20, arr+100, my_prng);
process_sample(arr, arr+20);
}
The Ross simulation book suggests something like the following:
double return[10];
for(int i=0, n=100; i < 10; i++) {
int x = rand(n); //pseudocode - generate an integer on [0,n]
return[i] = arr[x];
arr[x] = arr[n];
n--;
}