How do I print out vectors in different order every time - c++

I'm trying to make two vectors. Where vector1 (total1) is containing some strings and vector2(total2) is containing some random unique numbers(that are between 0 and total1.size() - 1)
I want to make a program that print out total1s strings, but in different order every turn. I don't want to use iterators or something because I want to improve my problem solving capacity.
Here is the specific function that crash the program.
for (unsigned i = 0; i < total1.size();)
{
v1 = rand() % total1.size();
for (unsigned s = 0; s < total1.size(); ++s)
{
if (v1 == total2[s])
;
else
{
total2.push_back(v1);
++i;
}
}
}
I'm very grateful for any help that I can get!

Can I suggest you change of algorithm?. Because, even if your current one is correctly implemented ("s", in your code, must go from 0 to total2.size not total1.size and if element is found, break and generate a new random), it has the following drawback: assume vectors of 1.000.000 elements and you are trying the last random number. You have one probability in 1.000.000 of find a random number not previously used. That is a very small amount.Last but one number has a probability of 2 in 1.000.000 also small. In conclusion, your program will loop and expend lots of CPU resources.
Your best alternative is follow #NathanOliver suggestion and look for function std::shuffle. The manual page shows the implementation algorithm, that is what you are looking for.
Another simple algorithm, with some pros and cons, is:
init total2 with sequence 0,1,2,...,n where n is the size total1 - 1
choice two random numbers, i1 and i2, in range [0,n-1].
Swap elements i1 and i2 in total2.
repeat from (2) a fixed number of times "R".
This method allows to known a priori the necessary steps and to control the level of "randomness" of the final vector (bigger R is more random). However, it is far to be good in its randomness quality.
Another method, better in the probabilistic distribution:
fill a list L with number 0,1,2,...size total1-1.
choice a random number i between 0 and the size of list L - 1 .
Store in total2 the i-th element in list L.
Remove this element from L.
repeat from (2) until L is empty.

If you just want to shuffle vector<string> total1, you can do this without using helping vector<int> total2. Here is an implementation based on Fisher–Yates shuffle.
for(int i=n-1; i>=1; i--) {
int j=rand()%(i+1);
swap(total1[j], total1[i]); // your prof might not allow use of swap:)
}
If you must use vector<int> total2 then shuffle it using above algorithm. Next you can use it to create a new vector<string> result from total1 where result[i]=total1[total2[i]].

Related

efficiently mask-out exactly 30% of array with 1M entries

My question's header is similar to this link, however that one wasn't answered to my expectations.
I have an array of integers (1 000 000 entries), and need to mask exactly 30% of elements.
My approach is to loop over elements and roll a dice for each one. Doing it in a non-interrupted manner is good for cache coherency.
As soon as I notice that exactly 300 000 of elements were indeed masked, I need to stop. However, I might reach the end of an array and have only 200 000 elements masked, forcing me to loop a second time, maybe even a third, etc.
What's the most efficient way to ensure I won't have to loop a second time, and not being biased towards picking some elements?
Edit:
//I need to preserve the order of elements.
//For instance, I might have:
[12, 14, 1, 24, 5, 8]
//Masking away 30% might give me:
[0, 14, 1, 24, 0, 8]
The result of masking must be the original array, with some elements set to zero
Just do a fisher-yates shuffle but stop at only 300000 iterations. The last 300000 elements will be the randomly chosen ones.
std::size_t size = 1000000;
for(std::size_t i = 0; i < 300000; ++i)
{
std::size_t r = std::rand() % size;
std::swap(array[r], array[size-1]);
--size;
}
I'm using std::rand for brevity. Obviously you want to use something better.
The other way is this:
for(std::size_t i = 0; i < 300000;)
{
std::size_t r = rand() % 1000000;
if(array[r] != 0)
{
array[r] = 0;
++i;
}
}
Which has no bias and does not reorder elements, but is inferior to fisher yates, especially for high percentages.
When I see a massive list, my mind always goes first to divide-and-conquer.
I won't be writing out a fully-fleshed algorithm here, just a skeleton. You seem like you have enough of a clue to take decent idea and run with it. I think I only need to point you in the right direction. With that said...
We'd need an RNG that can return a suitably-distributed value for how many masked values could potentially be below a given cut point in the list. I'll use the halfway point of the list for said cut. Some statistician can probably set you up with the right RNG function. (Anyone?) I don't want to assume it's just uniformly random [0..mask_count), but it might be.
Given that, you might do something like this:
// the magic RNG your stats homework will provide
int random_split_sub_count_lo( int count, int sub_count, int split_point );
void mask_random_sublist( int *list, int list_count, int sub_count )
{
if (list_count > SOME_SMALL_THRESHOLD)
{
int list_count_lo = list_count / 2; // arbitrary
int list_count_hi = list_count - list_count_lo;
int sub_count_lo = random_split_sub_count_lo( list_count, mask_count, list_count_lo );
int sub_count_hi = list_count - sub_count_lo;
mask( list, list_count_lo, sub_count_lo );
mask( list + sub_count_lo, list_count_hi, sub_count_hi );
}
else
{
// insert here some simple/obvious/naive implementation that
// would be ludicrous to use on a massive list due to complexity,
// but which works great on very small lists. I'm assuming you
// can do this part yourself.
}
}
Assuming you can find someone more informed on statistical distributions than I to provide you with a lead on the randomizer you need to split the sublist count, this should give you O(n) performance, with 'n' being the number of masked entries. Also, since the recursion is set up to traverse the actual physical array in constantly-ascending-index order, cache usage should be as optimal as it's gonna get.
Caveat: There may be minor distribution issues due to the discrete nature of the list versus the 30% fraction as you recurse down and down to smaller list sizes. In practice, I suspect this may not matter much, but whatever person this solution is meant for may not be satisfied that the random distribution is truly uniform when viewed under the microscope. YMMV, I guess.
Here's one suggestion. One million bits is only 128K which is not an onerous amount.
So create a bit array with all items initialised to zero. Then randomly select 300,000 of them (accounting for duplicates, of course) and mark those bits as one.
Then you can run through the bit array and, any that are set to one (or zero, if your idea of masking means you want to process the other 700,000), do whatever action you wish to the corresponding entry in the original array.
If you want to ensure there's no possibility of duplicates when randomly selecting them, just trade off space for time by using a Fisher-Yates shuffle.
Construct an collection of all the indices and, for each of the 700,000 you want removed (or 300,000 if, as mentioned, masking means you want to process the other ones) you want selected:
pick one at random from the remaining set.
copy the final element over the one selected.
reduce the set size.
This will leave you with a random subset of indices that you can use to process the integers in the main array.
You want reservoir sampling. Sample code courtesy of Wikipedia:
(*
S has items to sample, R will contain the result
*)
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]

Generating random integers with a difference constraint

I have the following problem:
Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.
At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).
Would there happen to be a more efficient algorithm?
An example of the problem:
Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000
A comprehensive way to solve this would be to:
Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
Select a uniformly random integer i in the range [0,|X|).
Select the i'th combination from X as the result.
This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.
Note: The following is a C++ implementation of the solution provided by pentadecagon
std::vector<int> generate_random(const int n, const int m, const int k)
{
if ((n < m) || (m < k))
return std::vector<int>();
std::random_device source;
std::mt19937 generator(source());
std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);
std::vector<int> result_list;
result_list.reserve(m);
for (int i = 0; i < m; ++i)
{
result_list.push_back(distribution(generator));
}
std::sort(std::begin(result_list),std::end(result_list));
for (int i = 0; i < m; ++i)
{
result_list[i] += (i * k);
}
return result_list;
}
http://ideone.com/KOeR4R
.
EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.
Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers
b_i=a_i + i*(K-1)
Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:
import random
def random_list( N, M, K ):
s = set()
while len(s) < M:
s.add( random.randint( 1, N-(M-1)*(K-1) ) )
res = sorted( s )
for i in range(M):
res[i] += i * (K-1)
return res
First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.
Bijection:
Let
Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).
From this we get a valid solution
by setting the values mi as follows:
We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)
To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let
be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:
Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:
The third equality follows since we have this telescoping sum that cancels out almost all mi.
So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.
Picking a composition uniformly at random
Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.
Note: #DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.
You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).
Why not do this:
for (int i = 0; i < M; ++i) {
pick a random number between K and N/M
add this number to (N/M)* i;
Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)
EDIT:
Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.
Second EDIT:
Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.
Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.
Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.
There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.
Third and Final EDIT:
For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:
int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
maxRange = N - (((M - 1) - i) * K) - prevValue;
int nextValue = random(0, maxRange);
prevValue += nextValue;
store previous value;
prevValue += K;
}
This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.
Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:
What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.
Rather than pseudocode, let me work an example using OP's original example:
0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))
In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.
By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.
Hope one of these ideas works for you!

What are practical uses for STL's 'partial_sum'?

What/where are the practical uses of the partial_sum algorithm in STL?
What are some other interesting/non-trivial examples or use-cases?
I used it to reduce memory usage of a simple mark-sweep garbage collector in my toy lambda calculus interpreter.
The GC pool is an array of objects of identical size. The goal is to eliminate objects that aren't linked to other objects, and condense the remaining objects into the beginning of the array. Since the objects are moved in memory, each link needs to be updated. This necessitates an object remapping table.
partial_sum allows the table to be stored in compressed format (as little as one bit per object) until the sweep is complete and memory has been freed. Since the objects are small, this significantly reduces memory use.
Recursively mark used objects and populate the Boolean array.
Use remove_if to condense the marked objects to the beginning of the pool.
Use partial_sum over the Boolean values to generate a table of pointers/indexes into the new pool.
This works because the Nth marked object has N preceding 1's in the array and acquires pool index N.
Sweep over the pool again and replace each link using the remap table.
It's especially friendly to the data cache to put the remap table in the just-freed, thus still hot, memory.
One thing to note about partial sum is that it is the operation that undoes adjacent difference much like - undoes +. Or better yet if you remember calculus the way differentiation undoes integration. Better because adjacent difference is essentially differentiation and partial sum is integration.
Let's say you have simulation of a car and at each time step you need to know the position, velocity, and acceleration. You only need to store one of those values as you can compute the other two. Say you store the position at each time step you can take the adjacent difference of the position to give the velocity and the adjacent difference of the velocity to give the acceleration. Alternatively, if you store the acceleration you can take the partial sum to give the velocity and the partial sum of the velocity gives the position.
Partial sum is one of those functions that doesn't come up too often for most people but is enormously useful when you find the right situation. A lot like calculus.
Last time I (would have) used it is when converting a discrete probability distribution (an array of p(X = k)) into a cumulative distribution (an array of p(X <= k)). To select once from the distribution, you can pick a number from [0-1) randomly, then binary search into the cumulative distribution.
That code wasn't in C++, though, so I did the partial sum myself.
You can use it to generate a monotonically increasing sequence of numbers. For example, the following generates a vector containing the numbers 1 through 42:
std::vector<int> v(42, 1);
std::partial_sum(v.begin(), v.end(), v.begin());
Is this an everyday use case? Probably not, though I've found it useful on several occasions.
You can also use std::partial_sum to generate a list of factorials. (This is even less useful, though, since the number of factorials that can be represented by a typical integer data type is quite limited. It is fun, though :-D)
std::vector<int> v(10, 1);
std::partial_sum(v.begin(), v.end(), v.begin());
std::partial_sum(v.begin(), v.end(), v.begin(), std::multiplies<int>());
Personal Use Case: Roulette-Wheel-Selection
I'm using partial_sum in a roulette-wheel-selection algorithm (link text). This algorithm choses randomly elements from a container with a probability which is linear to some value given beforehands.
Because all my elements to choose from bringing a not-necessarily normalized value, I use the partial_sum algorithm for constructing something like a "roulette-wheel", because I sum up all the elements. Then I chose a random variable in this range (the last partial_sum is the sum of all) and use stl::lower_bound for searching "the wheel" where my random search landed. The element returned by the lower_bound algorithm is the chosen one.
Besides the advantage of clear and expressive code with the use of partial_sum, I could also gain some speed when experimenting with the GCC parallel mode which brings parallelized versions for some algorithms and one of them is the partial_sum (link text).
Another use I know of: One of the most important algorithmic primitives in parallel processing (but maybe a little bit away from STL)
If you're interested in heavy optimized algorithms which are using partial_sum (in this case maybe more results under the synonyms "scan" or "prefix_sum"), than go to the parallel algorithms community. They need it all the time. You won't find a parallel sorting algorithm based on quicksort or mergesort without using it. This operation is one of the most important parallel primitives used. I think it is most commonly used for calculating offsets in dynamic algorithms. Think of a partition step in quicksort, which is split and fed to the parallel threads. You don't know the number of elements in each slot of the partition before calculating it. So you need some offsets for all the threads for later access.
Maybe you will find more informatin in the now-hot topic of GPU processing. One short article regarding Nvidia's CUDA and the scan-primitive with a few application examples you will find in Chapter 39. Parallel Prefix Sum (Scan) with CUDA.
Personal Use Case: intermediate step in counting sort from CLRS:
COUNTING_SORT (A, B, k)
for i ← 1 to k do
c[i] ← 0
for j ← 1 to n do
c[A[j]] ← c[A[j]] + 1
//c[i] now contains the number of elements equal to i
// std::partial_sum here
for i ← 2 to k do
c[i] ← c[i] + c[i-1]
// c[i] now contains the number of elements ≤ i
for j ← n downto 1 do
B[c[A[i]]] ← A[j]
c[A[i]] ← c[A[j]] - 1
You could build a "moving sum" (precursor to a moving average):
template <class T>
void moving_sum (const vector<T>& in, int num, vector<T>& out)
{
// cummulative sum
partial_sum (in.begin(), in.end(), out.begin());
// shift and subtract
int j;
for (int i = out.size() - 1; i >= 0; i--) {
j = i - num;
if (j >= 0)
out[i] -= out[j];
}
}
And then call it with:
vector<double> v(10);
// fill in v
vector<double> v2 (v.size());
moving_sum (v, 3, v2);
You know, I actually did use partial_sum() once... It was this interesting little problem that I was asked on a job interview. I enjoyed it so much, I went home and coded it up.
The problem was: Given a sequential sequence of integers, find the shortest sub-sequence with the highest value. E.g. Given:
Value: -1 2 3 -1 4 -2 -4 5
Index: 0 1 2 3 4 5 6 7
We would find the subsequence [1,4]
Now the obvious solution is to run with 3 for loops, iterating over all possible starts & ends, and adding up the value of each possible subsequence in turn. Inefficient, but quick to code up and hard to make mistakes. (Especially when the third for loop is just accumulate(start,end,0).)
The correct solution involves a divide-and-conquer / bottom up approach. E.g. Divide the problem space in half, and for each half compute the largest subsequence contained within that section, the largest subsequence including the starting number, the largest subsequence including the ending number, and the entire section's subsequence. Armed with this data we can then combine the two halves together without any further evaluation of either one. Obviously the data for each half can be computed by further breaking each half into halves (quarters), each quarter into halves (eighths), and so on until we have trivial singleton cases. It's all quite efficient.
But all that aside, there's a third (somewhat less efficient) option that I wanted to explore. It's similar to the 3-for-loop case, only we add the adjacent numbers to avoid so much work. The idea is that there's no need to add a+b, a+b+c, and a+b+c+d when we can add t1=a+b, t2=t1+c, and t3=t2+d. It's a space/computation tradeoff thing. It works by transforming the sequence:
Index: 0 1 2 3 4
FROM: 1 2 3 4 5
TO: 1 3 6 10 15
Thereby giving us all possible substrings starting at index=0 and ending at indexes=0,1,2,3,4.
Then we iterate over this set subtracting the successive possible "start" points...
FROM: 1 3 6 10 15
TO: - 2 5 9 14
TO: - - 3 7 12
TO: - - - 4 9
TO: - - - - 5
Thereby giving us the values (sums) of all possible subsequences.
We can find the maximum value of each iteration via max_element().
The first step is most easily accomplished via partial_sum().
The remaining steps via a for loop and transform(data+i,data+size,data+i,bind2nd(minus<TYPE>(),data[i-1])).
Clearly O(N^2). But still interesting and fun...
Partial sums are often useful in parallel algorithms. Consider the code
for (int i=0; N>i; ++i) {
sum += x[i];
do_something(sum);
}
If you want to parallelise this code, you need to know the partial sums. I am using GNUs parallel version of partial_sum for something very similar.
I often use partial sum not to sum but to calculate the current value in the sequence depending on the previous.
For example, if you integrate a function. Each new step is a previous step, vt += dvdt or vt = integrate_step(dvdt, t_prev, t_prev+dt);.
In nonparametric Bayesian methods there is a Metropolis-Hastings step (per observation) that determines to sample a new or an existing cluster. If an existing cluster has to be sampled this needs to be done with different weights. These weighted likelihoods are simulated in the following example code.
#include <random>
#include <iostream>
#include <algorithm>
int main() {
std::default_random_engine generator(std::random_device{}());
std::uniform_real_distribution<double> distribution(0.0,1.0);
int K = 8;
std::vector<double> weighted_likelihood(K);
for (int i = 0; i < K; ++i) {
weighted_likelihood[i] = i*10;
}
std::cout << "Weighted likelihood: ";
for (auto i: weighted_likelihood) std::cout << i << ' ';
std::cout << std::endl;
std::vector<double> cumsum_likelihood(K);
std::partial_sum(weighted_likelihood.begin(), weighted_likelihood.end(), cumsum_likelihood.begin());
std::cout << "Cumulative sum of weighted likelihood: ";
for (auto i: cumsum_likelihood) std::cout << i << ' ';
std::cout << std::endl;
std::vector<int> frequency(K);
int N = 280000;
for (int i = 0; i < N; ++i) {
double pick = distribution(generator) * cumsum_likelihood.back();
auto lower = std::lower_bound(cumsum_likelihood.begin(), cumsum_likelihood.end(), pick);
int index = std::distance(cumsum_likelihood.begin(), lower);
frequency[index]++;
}
std::cout << "Frequencies: ";
for (auto i: frequency) std::cout << i << ' ';
std::cout << std::endl;
}
Note that this is not different from the answer by https://stackoverflow.com/users/13005/steve-jessop. It's added to give a bit more context about a particular situation (nonparametric Bayesian mehods, e.g. the algorithms by Neal using the Dirichlet process as prior) and the actual code which uses partial_sum in combination with lower_bound.

How to get 2 random (different) elements from a c++ vector

I would like to get 2 random different elements from an std::vector. How can I do this so that:
It is fast (it is done thousands of times in my algorithm)
It is elegant
The elements selection is really uniformly distributed
For elegance and simplicty:
void Choose (const int size, int &first, int &second)
{
// pick a random element
first = rand () * size / MAX_RAND;
// pick a random element from what's left (there is one fewer to choose from)...
second = rand () * (size - 1) / MAX_RAND;
// ...and adjust second choice to take into account the first choice
if (second >= first)
{
++second;
}
}
using first and second to index the vector.
For uniformness, this is very tricky since as size approaches RAND_MAX there will be a bias towards the lower values and if size exceeds RAND_MAX then there will be elements that are never chosen. One solution to overcome this is to use a binary search:
int GetRand (int size)
{
int lower = 0, upper = size;
do
{
int mid = (lower + upper) / 2;
if (rand () > RAND_MAX / 2) // not a great test, perhaps use parity of rand ()?
{
lower = mid;
}
else
{
upper = mid;
}
} while (upper != lower); // this is just to show the idea,
// need to cope with lower == mid and lower != upper
// and all the other edge conditions
return lower;
}
What you need is to generate M uniformly distributed random numbers from [0, N) range, but there is one caveat here.
One needs to note that your statement of the problem is ambiguous. What is meant by the uniformly distributed selection? One thing is to say that each index has to be selected with equal probability (of M/N, of course). Another thing is to say that each two-index combination has to be selected with equal probability. These two are not the same. Which one did you have in mind?
If M is considerably smaller than N, the classic algorithm for selecting M numbers out of [0, N) range is Bob Floyd algorithm that can be found in Bentley's "Programming Peals" book. It looks as follows (a sketch)
for (int j = N - M; i < N; ++j) {
int rand = random(0, j); // generate a random integer in range [0, j]
if (`rand` has not been generated before)
output rand;
else
output j;
}
In order to implement the check of whether rand has already been generated or not for relatively high M some kind of implementation of a set is necessary, but in your case M=2 it is straightforward and easy.
Note that this algorithm distributes the sets of M numbers uniformly. Also, this algorithm requires exactly M iterations (attempts) to generate M random numbers, i.e. it doesn't follow that flawed "trial-and-error" approach often used in various ad-hoc algorithms intended to solve the same problem.
Adapting the above to your specific situation, the correct algorithm will look as follows
first = random(0, N - 2);
second = random(0, N - 1);
if (second == first)
second = N - 1;
(I leave out the internal details of random(a, b) as an implementation detail).
It might not be immediately obvious why the above works correctly and produces a truly uniform distribution, but it really does :)
How about using a std::queue and doing std::random_shuffle on them. Then just pop til your hearts content?
Not elegant, but extreamly simple: just draw a random number in [0, vector.size()[ and check it's not twice the same.
Simplicity is also in some way elegance ;)
What do you call fast ? I guess this can be done thousands of times within a millisecond.
Whenever need something random, you are going to have various questions about the random number properties regarding uniformity, distribution and so on.
Assuming you've found a suitable source of randomness for your application, then the simplest way to generate pairs of uncorrelated entries is just to pick two random indexes and test them to ensure they aren't equal.
Given a vector of N+1 entries, another option is to generate an index i in the range 0..N. element[i] is choice one. Swap elements i and N. Generate an index j in the range 0..(N-1). element[j] is your second choice. This slowly shuffles your vector which may be problematical, but it can be avoided by using a second vector which holds indexes into the first, and shuffling that. This method trades a swap for the index comparison and tends to be more efficient for small vectors (a dozen or fewer elements, typically) as it avoids having to do multiple comparisons as the number of collisions increase.
You might wanna look into the gnu scientific library. There are some pretty nice random number generators in there that are guaranteed to be random down to the bit level.

Fast generation of random set, Monte Carlo Simulation

I have a set of numbers ~100, I wish to perform MC simulation on this set, the basic idea is I fully randomize the set, do some comparison/checks on the first ~20 values, store the result and repeat.
Now the actual comparison/check algorithm is extremely fast it actually completes in about 50 CPU cycles. With this in mind, and in order to optimize these simulations I need to generate the random sets as fast as possible.
Currently I'm using a Multiply With Carry algorithm by George Marsaglia which provides me with a random integer in 17 CPU cycles, quite fast. However, using the Fisher-Yates shuffling algorithm I have to generate 100 random integers, ~1700 CPU cycles. This overshadows my comparison time by a long ways.
So my question is are there other well known/robust techniques for doing this type of MC simulation, where I can avoid the long random set generation time?
I thought about just randomly choosing 20 values from the set, but I would then have to do collision checks to ensure that 20 unique entries were chosen.
Update:
Thanks for the responses. I have another question with regards to a method I just came up with after my post. The question is, will this provide a robust truly (assuming the RNG is good) random output. Basically my method is to set up an array of integer values the same length as my input array, set every value to zero. Now I begin randomly choosing 20 values from the input set like so:
int pcfast[100];
memset(pcfast,0,sizeof(int)*100);
int nchosen = 0;
while (nchosen<20)
{
int k = rand(100); //[0,100]
if ( pcfast[k] == 0 )
{
pcfast[k] = 1;
r[nchosen++] = s[k]; // r is the length 20 output, s the input set.
}
}
Basically what I mentioned above, choosing 20 values at random, except it seems like a somewhat optimized way of ensuring no collisions. Will this provide good random output? Its quite fast.
If you only use the first 20 values in the randomised array, then you only need to do 20 steps of the Fisher-Yates algorithm (Knuth's version). Then 20 values have been randomised (actually at the end of the array rather than at the beginning, in the usual formulation), in the sense that the remaining 80 steps of the algorithm are guaranteed not to move them. The other 80 positions aren't fully shuffled, but who cares?
C++ code (iterators should be random-access):
using std::swap;
template <typename Iterator, typename Rand> // you didn't specify the type
void partial_shuffle(Iterator first, Iterator middle, Iterator last, Rand rnd) {
size_t n = last - first;
while (first != middle) {
size_t k = rnd(n); // random integer from 0 to n-1
swap(*(first+k),*first);
--n;
++first;
}
}
On return, the values from first through to middle-1 are shuffled. Use it like this:
int arr[100];
for (int i = 0; i < 100; ++i) arr[i] = i;
while (need_more_samples()) {
partial_shuffle(arr, arr+20, arr+100, my_prng);
process_sample(arr, arr+20);
}
The Ross simulation book suggests something like the following:
double return[10];
for(int i=0, n=100; i < 10; i++) {
int x = rand(n); //pseudocode - generate an integer on [0,n]
return[i] = arr[x];
arr[x] = arr[n];
n--;
}