C++ Fast Percentile Calculation - c++

I'm trying to write a percentile function that takes 2 vectors as input and 1 vector as output. One of the input vector (Distr) would be a distribution of random numbers. The other input vector (Tests) would be a vector of values that I want to calculate the percentile from Distr. The output would be a vector (same size as Tests) that returns the percentile for each value in Tests.
The following is an example of what I want:
Input Distr = {3, 5, 8, 12}
Input Tests = {4, 9}
Output Percentile = {0.375, 0.8125}
Following is my implementation in C++:
vector<double> Percentile(vector<double> Distr, vector<double> Tests)
{
double prevValue, nextValue;
vector<double> result;
unsigned distrSize = Distr.size();
std::sort(Distr.begin(), Distr.end());
for (vector<double>::iterator test = Tests.begin(); test != Tests.end(); test++)
{
if (*test <= Distr.front())
{
result.push_back((double) 1 / distrSize); // min percentile returned (not important)
}
else if (Distr.back() <= *test)
{
result.push_back(1); // max percentile returned (not important)
}
else
{
prevValue = Distr[0];
for (unsigned sortedDistrIdx = 1; sortedDistrIdx < distrSize; sortedDistrIdx++)
{
nextValue = Distr[sortedDistrIdx];
if (nextValue <= *test)
{
prevValue = nextValue;
}
else
{
// linear interpolation
result.push_back(((*test - prevValue) / (nextValue - prevValue) + sortedDistrIdx) / distrSize);
break;
}
}
}
}
return result;
}
The size of both Distr and Tests can be from 2,000 to 30,000.
Are there any existing libraries that can calculate percentile as shown above (or similar)? If not how can I make the above code faster?

I would do something like
vector<double> Percentile(vector<double> Distr, vector<double> Tests)
{
double prevValue, nextValue;
vector<double> result;
unsigned distrSize = Distr.size();
std::sort(Distr.begin(), Distr.end());
for (vector<double>::iterator test = Tests.begin(); test != Tests.end(); test++)
{
if (*test <= Distr.front())
{
result.push_back((double) 1 / distrSize); // min percentile returned (not important)
}
else if (Distr.back() <= *test)
{
result.push_back(1); // max percentile returned (not important)
}
else
{
auto it = lower_bound(Distr.begin(), Distr.end(), *test);
prevValue = *(it - 1);
nextValue = *(it + 1);
// linear interpolation
result.push_back(((*test - prevValue) / (nextValue - prevValue) + (it - Distr.begin())) / distrSize);
}
}
return result;
}
Note that instead of making a linear search on Distr for each test, I leverage the fact that Distr is sorted and make a binary search instead (using lower_bound).

There is a linear algorithm for your problem (linear times logarithmic in both sizes). You need to sort both vectors, and then have two iterators going through each (itDistr, itTest). There are three possibilities:
1.
*itDistr < *itTest
Here, you have nothing to except increment itDistr.
2.
*itDistr >= *itTest
This is the case when you found a test case where *itTest is element of the interval [ *(itDistr-1), *itDistr ). So you have to do the interpolation you have used (linear), and then increment itTest.
The third possibility is where any of then reaches the end of its container vector. You also have to define what happens in the beginning and at the and, it depends on how you define the distribution from the series of your numbers.
Are there any existing libraries that can calculate percentile as shown above (or similar)?
Probably, but it is easy to implement it, and you can have fine control over the interpolation technique.

The linear search of Distr for each element of Tests would be the major amount of time if both of those are large.
When Distr is much larger, it is much faster to do a binary search instead of linear. There is a binary search algorithm available in std. You don't need to write one.
When Tests is nearly as big as Distr, or bigger, it is faster to do an index sort of Tests and then sequence through the two sorted lists together storing the results, then output the stored results in a next pass.
Edit: I see the answer by Csaba Balint gives a little more detail on what I meant by "sequence through the two sorted lists together".
Edit: The basic methods being discussed are:
1) Sort both lists and then process linearly together, time NlogN+MlogM
2) Sort just one list and binary search, time (N+M)logM
3) Sort just the other list and partition, time I haven't figured out, but in the case of N and M similar, it has to be larger than either method 1 or 2, and in the case of N sufficiently tiny has to be smaller than methods 1 or 2.

This answer is relevant to the case that input is initially random (not sorted) and test.size() smaller than input.size(), which is the most common situation.
Suppose there is only one test value. Then you only have to partition the input with respect to this value and obtain the upper(lower) bound of the lower(upper) parition to compute the respective percentile. This is much faster than a full sort on input (which quicksort implements as a recursion of partitions).
If test.size()>1, then you first sort test (ideally, test is already sorted and you can skip this step) and subsequently proceed with the test elements in increasing order, each time only partitioning the upper part from the previous partition. Since we also keep track of the lower bound of the upper partition (as well as the upper bound of the lower partition), we can detect if no input data are between consecutive test elements, and avoid to partition.
This algorithm should be near-optimal, since no unnecessary information is generated (as it would be with a full sort of input).
If subsequent partitioning splits the input roughly in half, the algorithm would be optimal. This could be approximated by proceeding not in increasing order of test, but by subsequent halving of test, i.e. starting with the median test element, then the first & third quartile, etc..

Related

Optimizing this 'statistical coincidence' finding algorithm

Goal
The code below is designed to take in a vector<vector<float> > of random numbers from a Gaussian distribution, and perform the following:
Iterate simultaneously through all n columns of the vector until you encounter the first value such exceeding some threshold.
Continue iterating until either a) you encounter a second value exceeding that threshold such that that value comes from a different column that the first found value, or b) you exceed some maximum number of iterations.
In the case of a), continue iterating until either c) you find a third value exceeding the threshold such that the value comes from a different column than the first found value and the second found value, or b) you exceed some maximum number of iterations from the first found value. In the case of b) start over again, except this time start iterating at one row after the first found value.
In the case of c), add one to a counter, and jump forward some x rows. In the case of d), start over, except this time start iterating at one row after the first found value.
How I accomplish this:
In my opinion, the most challenging part is making sure all three values are contributed by a unique column. To tackle this, I used std::set. I iterate through each row of the vector<vector<float> >, then iterate through each column of that row. I check each column for a value exceeding the threshold, and store it's columnar number in an std::set.
I continue iterating. If I reach max_iterations, I jump back to one after the first-found value, empty the set, and reset the counter. If the std::set has size 3, I add one to the counter.
My issue:
This code will need to run on multidimensional vectors of sizes on the order of tens of columns and hundreds of thousands to millions of rows. As of now, that's excruciatingly slow. I'd like to improve performance significantly, if possible.
My code:
void findRate(float thresholdVolts){
set<size_t> cache;
vector<size_t> index;
size_t count = 0, found = 0;
for(auto rowItr = waveform.begin(); rowItr != waveform.end(); ++rowItr){
auto &row = *rowItr;
for(auto colnItr = row.begin(); colnItr != row.end(); ++colnItr){
auto &cell = *colnItr;
if(abs(cell/rmsVoltage) >= (thresholdVolts/rmsVoltage)){
cache.insert(std::distance(row.begin(), colnItr));
index.push_back(std::distance(row.begin(), colnItr));
}
}
if(cache.size() == 0) count == 0;
if(cache.size() == 3){
++found;
cache.clear();
if(std::distance(rowItr, output.end()) > ((4000 - count) + 4E+6)){
std::advance(rowItr, ((4000 - count) + 4E+6));
}
}
}
}
One thing you could do right away, in your inner loop. I understand that rmsVoltage is an external variable that is constant durng execution of the function.
for(auto colnItr = row.begin(); colnItr != row.end(); ++colnItr){
auto &cell = *colnItr;
// you can remove 2 divisions here. Divisions are the slowest
// arithmetic instructions on any cpu
//
// this:
// if(abs(cell/rmsVoltage) >= (thresholdVolts/rmsVoltage)){
//
// becomes this
if (abs(cell) >= thresholdVolts) {
cache.insert(std::distance(row.begin(), colnItr));
index.push_back(std::distance(row.begin(), colnItr));
}
And a bit below: why are you adding a floating point constant to a size_t ??
This may cause unnecessary conversions of size_t to double and then back to size_t, some compilers may hande this, but definitely not all.
These are relatively costly operations.
// this:
// if(std::distance(rowItr, output.end()) > ((4000 - count) + 4E+6)){
// std::advance(rowItr, ((4000 - count) + 4E+6));
// }
if (std::distance(rowItr, output.end()) > (4'004'000 - count))
std::advance(rowItr, 4'004'000 - count);
Also, after observing the needs in memory for your function, you should preallocate some reasonable space for containers cache and index, using vector<>::reserve(), and set<>::reserve().
Did you give us the entire algorithm? The contents of container index are not used anywhere.
Please let me know how much time you've gained with these changes.

efficiently mask-out exactly 30% of array with 1M entries

My question's header is similar to this link, however that one wasn't answered to my expectations.
I have an array of integers (1 000 000 entries), and need to mask exactly 30% of elements.
My approach is to loop over elements and roll a dice for each one. Doing it in a non-interrupted manner is good for cache coherency.
As soon as I notice that exactly 300 000 of elements were indeed masked, I need to stop. However, I might reach the end of an array and have only 200 000 elements masked, forcing me to loop a second time, maybe even a third, etc.
What's the most efficient way to ensure I won't have to loop a second time, and not being biased towards picking some elements?
Edit:
//I need to preserve the order of elements.
//For instance, I might have:
[12, 14, 1, 24, 5, 8]
//Masking away 30% might give me:
[0, 14, 1, 24, 0, 8]
The result of masking must be the original array, with some elements set to zero
Just do a fisher-yates shuffle but stop at only 300000 iterations. The last 300000 elements will be the randomly chosen ones.
std::size_t size = 1000000;
for(std::size_t i = 0; i < 300000; ++i)
{
std::size_t r = std::rand() % size;
std::swap(array[r], array[size-1]);
--size;
}
I'm using std::rand for brevity. Obviously you want to use something better.
The other way is this:
for(std::size_t i = 0; i < 300000;)
{
std::size_t r = rand() % 1000000;
if(array[r] != 0)
{
array[r] = 0;
++i;
}
}
Which has no bias and does not reorder elements, but is inferior to fisher yates, especially for high percentages.
When I see a massive list, my mind always goes first to divide-and-conquer.
I won't be writing out a fully-fleshed algorithm here, just a skeleton. You seem like you have enough of a clue to take decent idea and run with it. I think I only need to point you in the right direction. With that said...
We'd need an RNG that can return a suitably-distributed value for how many masked values could potentially be below a given cut point in the list. I'll use the halfway point of the list for said cut. Some statistician can probably set you up with the right RNG function. (Anyone?) I don't want to assume it's just uniformly random [0..mask_count), but it might be.
Given that, you might do something like this:
// the magic RNG your stats homework will provide
int random_split_sub_count_lo( int count, int sub_count, int split_point );
void mask_random_sublist( int *list, int list_count, int sub_count )
{
if (list_count > SOME_SMALL_THRESHOLD)
{
int list_count_lo = list_count / 2; // arbitrary
int list_count_hi = list_count - list_count_lo;
int sub_count_lo = random_split_sub_count_lo( list_count, mask_count, list_count_lo );
int sub_count_hi = list_count - sub_count_lo;
mask( list, list_count_lo, sub_count_lo );
mask( list + sub_count_lo, list_count_hi, sub_count_hi );
}
else
{
// insert here some simple/obvious/naive implementation that
// would be ludicrous to use on a massive list due to complexity,
// but which works great on very small lists. I'm assuming you
// can do this part yourself.
}
}
Assuming you can find someone more informed on statistical distributions than I to provide you with a lead on the randomizer you need to split the sublist count, this should give you O(n) performance, with 'n' being the number of masked entries. Also, since the recursion is set up to traverse the actual physical array in constantly-ascending-index order, cache usage should be as optimal as it's gonna get.
Caveat: There may be minor distribution issues due to the discrete nature of the list versus the 30% fraction as you recurse down and down to smaller list sizes. In practice, I suspect this may not matter much, but whatever person this solution is meant for may not be satisfied that the random distribution is truly uniform when viewed under the microscope. YMMV, I guess.
Here's one suggestion. One million bits is only 128K which is not an onerous amount.
So create a bit array with all items initialised to zero. Then randomly select 300,000 of them (accounting for duplicates, of course) and mark those bits as one.
Then you can run through the bit array and, any that are set to one (or zero, if your idea of masking means you want to process the other 700,000), do whatever action you wish to the corresponding entry in the original array.
If you want to ensure there's no possibility of duplicates when randomly selecting them, just trade off space for time by using a Fisher-Yates shuffle.
Construct an collection of all the indices and, for each of the 700,000 you want removed (or 300,000 if, as mentioned, masking means you want to process the other ones) you want selected:
pick one at random from the remaining set.
copy the final element over the one selected.
reduce the set size.
This will leave you with a random subset of indices that you can use to process the integers in the main array.
You want reservoir sampling. Sample code courtesy of Wikipedia:
(*
S has items to sample, R will contain the result
*)
ReservoirSample(S[1..n], R[1..k])
// fill the reservoir array
for i = 1 to k
R[i] := S[i]
// replace elements with gradually decreasing probability
for i = k+1 to n
j := random(1, i) // important: inclusive range
if j <= k
R[j] := S[i]

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

How to figure out "progress" while sorting?

I'm using stable_sort to sort a large vector.
The sorting takes on the order of a few seconds (say, 5-10 seconds), and I would like to display a progress bar to the user showing how much of the sorting is done so far.
But (even if I was to write my own sorting routine) how can I tell how much progress I have made, and how much more there is left to go?
I don't need it to be exact, but I need it to be "reasonable" (i.e. reasonably linear, not faked, and certainly not backtracking).
Standard library sort uses a user-supplied comparison function, so you can insert a comparison counter into it. The total number of comparisons for either quicksort/introsort or mergesort will be very close to log2N * N (where N is the number of elements in the vector). So that's what I'd export to a progress bar: number of comparisons / N*log2N
Since you're using mergesort, the comparison count will be a very precise measure of progress. It might be slightly non-linear if the implementation spends time permuting the vector between comparison runs, but I doubt your users will see the non-linearity (and anyway, we're all used to inaccurate non-linear progress bars :) ).
Quicksort/introsort would show more variance, depending on the nature of the data, but even in that case it's better than nothing, and you could always add a fudge factor on the basis of experience.
A simple counter in your compare class will cost you practically nothing. Personally I wouldn't even bother locking it (the locks would hurt performance); it's unlikely to get into an inconsistent state, and anyway the progress bar won't go start radiating lizards just because it gets an inconsistent progress number.
Split the vector into several equal sections, the quantity depending upon the granularity of progress reporting you desire. Sort each section seperately. Then start merging with std::merge. You can report your progress after sorting each section, and after each merge. You'll need to experiment to determine how much percentage the sorting of the sections should be counted compared to the mergings.
Edit:
I did some experiments of my own and found the merging to be insignificant compared to the sorting, and this is the function I came up with:
template<typename It, typename Comp, typename Reporter>
void ReportSort(It ibegin, It iend, Comp cmp, Reporter report, double range_low=0.0, double range_high=1.0)
{
double range_span = range_high - range_low;
double range_mid = range_low + range_span/2.0;
using namespace std;
auto size = iend - ibegin;
if (size < 32768) {
stable_sort(ibegin,iend,cmp);
} else {
ReportSort(ibegin,ibegin+size/2,cmp,report,range_low,range_mid);
report(range_mid);
ReportSort(ibegin+size/2,iend,cmp,report,range_mid,range_high);
inplace_merge(ibegin, ibegin + size/2, iend);
}
}
int main()
{
std::vector<int> v(100000000);
std::iota(v.begin(), v.end(), 0);
std::random_shuffle(v.begin(), v.end());
std::cout << "starting...\n";
double percent_done = 0.0;
auto report = [&](double d) {
if (d - percent_done >= 0.05) {
percent_done += 0.05;
std::cout << static_cast<int>(percent_done * 100) << "%\n";
}
};
ReportSort(v.begin(), v.end(), std::less<int>(), report);
}
Stable sort is based on merge sort. If you wrote your own version of merge sort then (ignoring some speed-up tricks) you would see that it consists of log N passes. Each pass starts with 2^k sorted lists and produces 2^(k-1) lists, with the sort finished when it merges two lists into one. So you could use the value of k as an indication of progress.
If you were going to run experiments, you might instrument the comparison object to count the number of comparisons made and try and see if the number of comparisons made is some reasonably predictable multiple of n log n. Then you could keep track of progress by counting the number of comparisons done.
(Note that with the C++ stable sort, you have to hope that it finds enough store to hold a copy of the data. Otherwise the cost goes from N log N to perhaps N (log N)^2 and your predictions will be far too optimistic).
Select a small subset of indices and count inversions. You know its maximal value, and you know when you are done the value is zero. So, you can use this value as a "progressor". You can think of it as a measure of entropy.
Easiest way to do it: sort a small vector and extrapolate the time assuming O(n log n) complexity.
t(n) = C * n * log(n) ⇒ t(n1) / t(n2) = n1/n2 * log(n1)/log(n2)
If sorting 10 elements takes 1 μs, then 100 elements will take 1 μs * 100/10 * log(100)/log(10) = 20 μs.
Quicksort is basically
partition input using a pivot element
sort smallest part recursively
sort largest part using tail recursion
All the work is done in the partition step. You could do the outer partition directly and then report progress as the smallest part is done.
So there would be an additional step between 2 and 3 above.
Update progressor
Here is some code.
template <typename RandomAccessIterator>
void sort_wReporting(RandomAccessIterator first, RandomAccessIterator last)
{
double done = 0;
double whole = static_cast<double>(std::distance(first, last));
typedef typename std::iterator_traits<RandomAccessIterator>::value_type value_type;
while (first != last && first + 1 != last)
{
auto d = std::distance(first, last);
value_type pivot = *(first + std::rand() % d);
auto iter = std::partition(first, last,
[pivot](const value_type& x){ return x < pivot; });
auto lower = distance(first, iter);
auto upper = distance(iter, last);
if (lower < upper)
{
std::sort(first, iter);
done += lower;
first = iter;
}
else
{
std::sort(iter, last);
done += upper;
last = iter;
}
std::cout << done / whole << std::endl;
}
}
I spent almost one day to figure out how to display the progress for shell sort, so I will leave here my simple formula. Given an array of colors, it will display the progress. It is blending the colors from red to yellow and then to green. When it is Sorted, it is the last position of the array that is blue. For shell sort, the iterations each time it passes through the array are quite proportional, so the progress becomes pretty accurate.
(Code in Dart/Flutter)
List<Color> colors = [
Color(0xFFFF0000),
Color(0xFFFF5500),
Color(0xFFFFAA00),
Color(0xFFFFFF00),
Color(0xFFAAFF00),
Color(0xFF55FF00),
Color(0xFF00FF00),
Colors.blue,
];
[...]
style: TextStyle(
color: colors[(((pass - 1) * (colors.length - 1)) / (log(a.length) / log(2)).floor()).floor()]),
It is basically a cross-multiplication.
a means array. (log(a.length) / log(2)).floor() means rounding down the log2(N), where N means the number of items. I tested this with several combinations of array sizes, array numbers, and sizes for the array of colors, so I think it is good to go.

How to get 2 random (different) elements from a c++ vector

I would like to get 2 random different elements from an std::vector. How can I do this so that:
It is fast (it is done thousands of times in my algorithm)
It is elegant
The elements selection is really uniformly distributed
For elegance and simplicty:
void Choose (const int size, int &first, int &second)
{
// pick a random element
first = rand () * size / MAX_RAND;
// pick a random element from what's left (there is one fewer to choose from)...
second = rand () * (size - 1) / MAX_RAND;
// ...and adjust second choice to take into account the first choice
if (second >= first)
{
++second;
}
}
using first and second to index the vector.
For uniformness, this is very tricky since as size approaches RAND_MAX there will be a bias towards the lower values and if size exceeds RAND_MAX then there will be elements that are never chosen. One solution to overcome this is to use a binary search:
int GetRand (int size)
{
int lower = 0, upper = size;
do
{
int mid = (lower + upper) / 2;
if (rand () > RAND_MAX / 2) // not a great test, perhaps use parity of rand ()?
{
lower = mid;
}
else
{
upper = mid;
}
} while (upper != lower); // this is just to show the idea,
// need to cope with lower == mid and lower != upper
// and all the other edge conditions
return lower;
}
What you need is to generate M uniformly distributed random numbers from [0, N) range, but there is one caveat here.
One needs to note that your statement of the problem is ambiguous. What is meant by the uniformly distributed selection? One thing is to say that each index has to be selected with equal probability (of M/N, of course). Another thing is to say that each two-index combination has to be selected with equal probability. These two are not the same. Which one did you have in mind?
If M is considerably smaller than N, the classic algorithm for selecting M numbers out of [0, N) range is Bob Floyd algorithm that can be found in Bentley's "Programming Peals" book. It looks as follows (a sketch)
for (int j = N - M; i < N; ++j) {
int rand = random(0, j); // generate a random integer in range [0, j]
if (`rand` has not been generated before)
output rand;
else
output j;
}
In order to implement the check of whether rand has already been generated or not for relatively high M some kind of implementation of a set is necessary, but in your case M=2 it is straightforward and easy.
Note that this algorithm distributes the sets of M numbers uniformly. Also, this algorithm requires exactly M iterations (attempts) to generate M random numbers, i.e. it doesn't follow that flawed "trial-and-error" approach often used in various ad-hoc algorithms intended to solve the same problem.
Adapting the above to your specific situation, the correct algorithm will look as follows
first = random(0, N - 2);
second = random(0, N - 1);
if (second == first)
second = N - 1;
(I leave out the internal details of random(a, b) as an implementation detail).
It might not be immediately obvious why the above works correctly and produces a truly uniform distribution, but it really does :)
How about using a std::queue and doing std::random_shuffle on them. Then just pop til your hearts content?
Not elegant, but extreamly simple: just draw a random number in [0, vector.size()[ and check it's not twice the same.
Simplicity is also in some way elegance ;)
What do you call fast ? I guess this can be done thousands of times within a millisecond.
Whenever need something random, you are going to have various questions about the random number properties regarding uniformity, distribution and so on.
Assuming you've found a suitable source of randomness for your application, then the simplest way to generate pairs of uncorrelated entries is just to pick two random indexes and test them to ensure they aren't equal.
Given a vector of N+1 entries, another option is to generate an index i in the range 0..N. element[i] is choice one. Swap elements i and N. Generate an index j in the range 0..(N-1). element[j] is your second choice. This slowly shuffles your vector which may be problematical, but it can be avoided by using a second vector which holds indexes into the first, and shuffling that. This method trades a swap for the index comparison and tends to be more efficient for small vectors (a dozen or fewer elements, typically) as it avoids having to do multiple comparisons as the number of collisions increase.
You might wanna look into the gnu scientific library. There are some pretty nice random number generators in there that are guaranteed to be random down to the bit level.