STL random_sample replace with a decreasing probability - c++

The STL random_sample function use a replace strategy to sample from the given interval. Why we need a decreasing probability, I've seen a similar algorithm without decrease the replacement probability. What is the difference. The code is as:
/*This is an excerpt from STL implementation*/
template <class InputIterator, class RandomAccessIterator, class Distance>
RandomAccessIterator __random_sample(InputIterator first, InputIterator last,
RandomAccessIterator out,
const Distance n)
{
Distance m = 0;
Distance t = n;
for ( ; first != last && m < n; ++m, ++first) //the strategy is also used in mahout
out[m] = *first;//fill it
while (first != last) {
++t;
Distance M = lrand48() % t;
if (M < n)
out[M] = *first;//replace it with a decreasing probability
++first;
}
return out + m;
}

It is obvious that the first element (indeed, the first n elements) has to be chosen with the probability of 1 (the "fill it" stage). In order to remain in the final sample, this first element needs then to survive m-n possible replacements - that's what brings the probability of it being in the sample down to n/m. On the other hand, the last element only participates in one replacement; it should thus be added to the sample with the probability of n/m from the start.
For simplicity, suppose that you only need to pick one element, out of m, using this replacement strategy (note that you don't know up front what m is: you just iterate until you suddenly reach the end). You take the first element, and keep it with a probability of 1 (for all you know, this is the only element). Then you discover the second element, and you throw a coin and either keep it or discard it with the probability of 1/2. At this point, each of the first two elements has 1/2 probability to be the chosen one.
Now you see the third element, and you keep it with probability of 1/3. Each of the first two elements had 1/2 probability of participating in this encounter, and 2/3 of surviving it - for a total of 1/2 * 2/3 == 1/3 probability of still being around. So, again, each of the first three elements has 1/3 probability of being chosen at this point.
The inductive step of proving that, after tth element is inspected, each of the first t elements has 1/t probability of being chosen is left as an exercise for the reader. So is the extension of the proof to a sample of size n > 1.

Related

How to erase elements more efficiently from a vector or set?

Problem statement:
Input:
First two inputs are integers n and m. n is the number of knights fighting in the tournament (2 <= n <= 100000, 1 <= m <= n-1). m is the number of battles that will take place.
The next line contains n power levels.
The next m lines contain two integers l and r, indicating the range of knight positions to compete in the ith battle.
After each battle, all nights apart from the one with the highest power level will be eliminated.
The range for each battle is given in terms of the new positions of the knights, not the original positions.
Output:
Output m lines, the ith line containing the original positions (indices) of the knights from that battle. Each line is in ascending order.
Sample Input:
8 4
1 0 5 6 2 3 7 4
1 3
2 4
1 3
0 1
Sample Output:
1 2
4 5
3 7
0
Here is a visualisation of this process.
1 2
[(1,0),(0,1),(5,2),(6,3),(2,4),(3,5),(7,6),(4,7)]
-----------------
4 5
[(1,0),(6,3),(2,4),(3,5),(7,6),(4,7)]
-----------------
3 7
[(1,0),(6,3),(7,6),(4,7)]
-----------------
0
[(1,0),(7,6)]
-----------
[(7,6)]
I have solved this problem. My program produces the correct output, however, it is O(n*m) = O(n^2). I believe that if I erase knights more efficiently from the vector, efficiency can be increased. Would it be more efficient to erase elements using a set? I.e. erase contiguous segments rather that individual knights. Is there an alternative way to do this that is more efficient?
#define INPUT1(x) scanf("%d", &x)
#define INPUT2(x, y) scanf("%d%d", &x, &y)
#define OUTPUT1(x) printf("%d\n", x);
int main(int argc, char const *argv[]) {
int n, m;
INPUT2(n, m);
vector< pair<int,int> > knights(n);
for (int i = 0; i < n; i++) {
int power;
INPUT(power);
knights[i] = make_pair(power, i);
}
while(m--) {
int l, r;
INPUT2(l, r);
int max_in_range = knights[l].first;
for (int i = l+1; i <= r; i++) if (knights[i].first > max_in_range) {
max_in_range = knights[i].first;
}
int offset = l;
int range = r-l+1;
while (range--) {
if (knights[offset].first != max_in_range) {
OUTPUT1(knights[offset].second));
knights.erase(knights.begin()+offset);
}
else offset++;
}
printf("\n");
}
}
Well, removing from vector wouldn't be efficient for sure. Removing from set, or unordered set would be more effective (use iterators instead of indexes).
Yet the problem will still remain O(n^2), because you have two nested whiles running n*m times.
--EDIT--
I believe I understand the question now :)
First let's calculate the complexity of your code above. Your worst case would be the case that max range in all battles is 1 (two nights for each battle) and the battles are not ordered with respect to the position. Which means you have m battles (in this case m = n-1 ~= O(n))
The first while loop runs n times
For runs for once every time which makes it n*1 = n in total
The second while loop runs once every time which makes it n again.
Deleting from vector means n-1 shifts that makes it O(n).
Thus with the complexity of the vector total complexity is O(n^2)
First of all, you don't really need the inner for loop. Take the first knight as the max in range, compare the rest in the range one-by-one and remove the defeated ones.
Now, i believe it can be done in O(nlogn) with using std::map. The key to the map is the position and the value is the level of the knight.
Before proceeding, finding and removing an element in map is logarithmic, iterating is constant.
Finally, your code should look like:
while(m--) // n times
strongest = map.find(first_position); // find is log(n) --> n*log(n)
for (opponent = next of strongest; // this will run 1 times, since every range is 1
opponent in range;
opponent = next opponent) // iterating is constant
// removing from map is log(n) --> n * 1 * log(n)
if strongest < opponent
remove strongest, opponent is the new strongest
else
remove opponent, (be careful to remove it after iterating to next)
Ok, now the upper bound would be O(2*nlogn) = O(nlogn). If the ranges increases, that makes the run time of upper loop decrease but increases the number of remove operations. I'm sure the upper bound won't change, let's make it a homework for you to calculate :)
A solution with a treap is pretty straightforward.
For each query, you need to split the treap by implicit key to obtain the subtree that corresponds to the [l, r] range (it takes O(log n) time).
After that, you can iterate over the subtree and find the knight with the maximum strength. After that, you just need to merge the [0, l) and [r + 1, end) parts of the treap with the node that corresponds to this knight.
It's clear that all parts of the solution except for the subtree traversal and printing work in O(log n) time per query. However, each operation reinserts only one knight and erase the rest from the range, so the size of the output (and the sum of sizes of subtrees) is linear in n. So the total time complexity is O(n log n).
I don't think you can solve with standard stl containers because there'no standard container that supports getting an iterator by index quickly and removing arbitrary elements.

Pick a matrix cell according to its probability

I have a 2D matrix of positive real values, stored as follow:
vector<vector<double>> matrix;
Each cell can have a value equal or greater to 0, and this value represents the possibility of the cell to be chosen. In particular, for example, a cell with a value equals to 3 has three times the probability to be chosen compared to a cell with value 1.
I need to select N cells of the matrix (0 <= N <= total number of cells) randomly, but according to their probability to be selected.
How can I do that?
The algorithm should be as fast as possible.
I describe two methods, A and B.
A works in time approximately N * number of cells, and uses space O(log number of cells). It is good when N is small.
B works in time approximately (number of cells + N) * O(log number of cells), and uses space O(number of cells). So, it is good when N is large (or even, 'medium') but uses a lot more memory, in practice it might be slower in some regimes for that reason.
Method A:
The first thing you need to do is normalize the entries. (It's not clear to me if you assume they are normalized or not.) That means, sum all the entries and divide by the sum. (This part is potentially slow, so it's better if you assume or require that it already happened.)
Then you sample like this:
Choose a random [i,j] entry of the matrix (by choosing i,j each uniformly randomly from the range of integers 0 to n-1).
Choose a uniformly random real number p in the range [0, 1].
Check if matrix[i][j] > p. If so, return the pair [i][j]. If not, go back to step 1.
Why does this work? The probability that we end at step 3 with any particular output, is equal to, the probability that [i][j] was selected (this is the same for each entry), times the probality that the number p was small enough. This is proportional to the value matrix[i][j], so the sampling is choosing each entry with the correct proportions. It's also possible that at step 3 we go back to the start -- does that bias things? Basically, no. The reason is, suppose we arbitrarily choose a number k and then consider the distribution of the algorithm, conditioned on stopping exactly after k rounds. Conditioned on the assumption that we stop at the k'th round, no matter what value k we choose, the distribution we sample has to be exactly right by the above argument. Since if we eliminate the case that p is too small, the other possibilities all have their proportions correct. Since the distribution is perfect for each value of k that we might condition on, and the overall distribution (not conditioned on k) is an average of the distributions for each value of k, the overall distribution is perfect also.
If you want to analyze the number of rounds that typically needed in a rigorous way, you can do it by analyzing the probability that we actually stop at step 3 for any particular round. Since the rounds are independent, this is the same for every round, and statistically, it means that the running time of the algorithm is poisson distributed. That means it is tightly concentrated around its mean, and we can determine the mean by knowing that probability.
The probability that we stop at step 3 can be determined by considering the conditional probability that we stop at step 3, given that we chose any particular entry [i][j]. By the formulas for conditional expectation, you get that
Pr[ stop at step 3 ] = sum_{i,j} ( 1/(n^2) * Matrix[i,j] )
Since we assumed the matrix is normalized, this sum reduces to just 1/n^2. So, the expected number of rounds is about n^2 (that is, n^2 up to a constant factor) no matter what the entries in the matrix are. You can't hope to do a lot better than that I think -- that's about the same amount of time it takes to just read all the entries of the matrix, and it's hard to sample from a distribution that you cannot even read all of.
Note: What I described is a way to correctly sample a single element -- to get N elements from one matrix, you can just repeat it N times.
Method B:
Basically you just want to compute a histogram and sample inversely from it, so that you know you get exactly the right distribution. Computing the histogram is expensive, but once you have it, getting samples is cheap and easy.
In C++ it might look like this:
// Make histogram
typedef unsigned int uint;
typedef std::pair<uint, uint> upair;
typedef std::map<double, upair> histogram_type;
histogram_type histogram;
double cumulative = 0.0f;
for (uint i = 0; i < Matrix.size(); ++i) {
for (uint j = 0; j < Matrix[i].size(); ++j) {
cumulative += Matrix[i][j];
histogram[cumulative] = std::make_pair(i,j);
}
}
std::vector<upair> result;
for (uint k = 0; k < N; ++k) {
// Do a sample (this should never repeat... if it does not find a lower bound you could also assert false quite reasonably since it means something is wrong with rand() implementation)
while(1) {
double p = cumulative * rand(); // Or, for best results use std::mt19937 or boost::mt19937 and sample a real in the range [0,1] here.
histogram_type::iterator it = histogram::lower_bound(p);
if (it != histogram.end()) {
result.push_back(it->second);
break;
}
}
}
return result;
Here the time to make the histogram is something like number of cells * O(log number of cells) since inserting into the map takes time O(log n). You need an ordered data structure in order to get cheap lookup N * O(log number of cells) later when you do repeated sampling. Possibly you could choose a more specialized data structure to go faster, but I think there's only limited room for improvement.
Edit: As #Bob__ points out in comments, in method (B) a written there is potentially going to be some error due to floating point round-off if the matrices are quite large, even using type double, at this line:
cumulative += Matrix[i][j];
The problem is that, if cumulative is much larger than Matrix[i][j] beyond what the floating point precision can handle then these each time this statement is executed you may observe significant errors which accumulate to introduce significant inaccuracy.
As he suggests, if that happens, the most straightforward way to fix it is to sort the values Matrix[i][j] first. You could even do this in the general implementation to be safe -- sorting these guys isn't going to take more time asymptotically than you already have anyways.

Something faster than std::nth_element

I'm working on a kd-tree implementation and I'm currently using std::nth_element for partition a vector of elements by their median. However std::nth_element takes 90% of the time of tree construction. Can anyone suggest a more efficient alternative?
Thanks in advance
Do you really need the nth element, or do you need an element "near" the middle?
There are faster ways to get an element "near" the middle. One example goes roughly like:
function rough_middle(container)
divide container into subsequences of length 5
find median of each subsequence of length 5 ~ O(k) * O(n/5)
return rough_middle( { median of each subsequence} ) ~ O(rough_middle(n/5))
The result should be something that is roughly in the middle. A real nth element algorithm might use something like the above, and then clean it up afterwards to find the actual nth element.
At n=5, you get the middle.
At n=25, you get the middle of the short sequence middles. This is going to be greater than all of the lesser of each short sequence, or at least the 9th element and no more than the 16th element, or 36% away from edge.
At n=125, you get the rough middle of each short sequence middle. This is at least the 9th middle, so there are 8*3+2=26 elements less than your rough middle, or 20.8% away from edge.
At n=625, you get the rough middle of each short sequence middle. This is at least the 26th middle, so there are 77 elements less than your rough middle, or 12% away from the edge.
At n=5^k, you get the rough middle of the 5^(k-1) rough middles. If the rough middle of a 5^k sequence is r(k), then r(k+1) = r(k)*3-1 ~ 3^k.
3^k grows slower than 5^k in O-notation.
3^log_5(n)
= e^( ln(3) ln(n)/ln(5) )
= n^(ln(3)/ln(5))
=~ n^0.68
is a very rough estimate of the lower bound of where the rough_middle of a sequence of n elements ends up.
In theory, it may take as many as approx n^0.33 iterations of reductions to reach a single element, which isn't really that good. (the number of bits in n^0.68 is ~0.68 times the number of bits in n. If we shave that much off each rough middle, we need to repeat it very roughly n^0.33 times number of bits in n to consume all the bits -- more, because as we subtract from the n, the next n gets a slightly smaller value subtracted from it).
The way that the nth element solutions I've seen solve this is by doing a partition and repair at each level: instead of recursing into rough_middle, you recurse into middle. The real middle of the medians is then guaranteed to be pretty close to the actual middle of your sequence, and you can "find the real middle" relatively quickly (in O-notation) from this.
Possibly we can optimize this process by doing a more accurate rough_middle iterations when there are more elements, but never forcing it to be the actual middle? The bigger the end n is, the closer to the middle we need the recursive calls to be to the middle for the end result to be reasonably close to the middle.
But in practice, the probability that your sequence is a really bad one that actually takes n^0.33 steps to partition down to nothing might be really low. Sort of like the quicksort problem: median of 3 elements is usually good enough.
A quick stats analysis.
You pick 5 elements at random, and pick the middle one.
The median index of a set of 2m+1 random sample of a uniform distribution follows the beta distribution with parameters of roughly (m+1, m+1), with maybe some scaling factors for non-[0,1] intervals.
The mean of the median is clearly 1/2. The variance is:
(3*3)^2 / ( (3+3)^2 (3+3+1) )
= 81 / (36 * 7)
=~ 0.32
Figuring out the next step is beyond my stats. I'll cheat.
If we imagine that taking the median index element from a bunch of items with mean 0.5 and variance 0.32 is as good as averaging their index...
Let n now be the number of elements in our original set.
Then the sum of the indexes of medians of the short sequences has an average of n times n/5*0.5 = 0.1 * n^2. The variance of the sum of the indexes of the medians of the short sequences is n times n/5*0.32 = 0.064 * n^2.
If we then divide the value by n/5 we get:
So mean of n/2 and variance of 1.6.
Oh, if that was true, that would be awesome. Variance that doesn't grow with the size of n means that as n gets large, the average index of the medians of the short sequences gets ridiculously tightly distributed. I guess it makes some sense. Sadly, we aren't quite doing that -- we want the distribution of the pseudo-median of the medians of the short sequences. Which is almost certainly worse.
Implementation detail. We can with logarithmic number of memory overhead do an in-place rough median. (we might even be able to do it without the memory overhead!)
We maintain a vector of 5 indexes with a "nothing here" placeholder.
Each is a successive layer.
At each element, we advance the bottom index. If it is full, we grab the median, and insert it on the next level up, and clear the bottom layer.
At the end, we complete.
using target = std::pair<size_t,std::array<size_t, 5>>;
bool push( target& t, size_t i ) {
t.second[t.first]=i;
++t.first;
if (t.first==5)
return true;
}
template<class Container>
size_t extract_median( Container const& c, target& t ) {
Assert(t.first != 0);
std::sort( t.data(), t.data()+t.first, [&c](size_t lhs, size_t rhs){
return c[lhs]<c[rhs];
} );
size_t r = t[(t.first+1)/2];
t.first = 0;
return r;
}
template<class Container>
void advance(Container const& c, std::vector<target>& targets, size_t i) {
size_t height = 0;
while(true) {
if (targets.size() <= height)
targets.push_back({});
if (!push(targets[height], i))
return;
i = extract_median(c, targets[height]);
}
}
template<class Container>
size_t collapse(Container const& c, target* b, target* e) {
if (b==e) return -1;
size_t before = collapse(c, b, e-1);
target& last = (*e-1);
if (before!=-1)
push(before, last);
if (last.first == 0)
return -1;
return extract_median(c, last);
}
template<class Container>
size_t rough_median_index( Container const& c ) {
std::vector<target> targets;
for (auto const& x:c) {
advance(c, targets, &x-c.data());
}
return collapse(c, targets.data(), targets.data()+targets.size());
}
which sketches out how it could work on random access containers.
If you have more lookups than insertions into the vector you could consider using a data structure which sorts on insertion -- such as std::set -- and then use std::advance() to get the n'th element in sorted order.

Generating random integers with a difference constraint

I have the following problem:
Generate M uniformly random integers from the range 0-N, where N >> M, and where no pair has a difference less than K. where M >> K.
At the moment the best method I can think of is to maintain a sorted list, then determine the lower bound of the current generated integer and test it with the lower and upper elements, if it's ok to then insert the element in between. This is of complexity O(nlogn).
Would there happen to be a more efficient algorithm?
An example of the problem:
Generate 1000 uniformly random integers between zero and 100million where the difference between any two integers is no less than 1000
A comprehensive way to solve this would be to:
Determine all the combinations of n-choose-m that satisfy the constraint, lets called it set X
Select a uniformly random integer i in the range [0,|X|).
Select the i'th combination from X as the result.
This solution is problematic when the n-choose-m is large, as enumerating and storing all possible combinations will be extremely costly. Hence an efficient online generating solution is sought.
Note: The following is a C++ implementation of the solution provided by pentadecagon
std::vector<int> generate_random(const int n, const int m, const int k)
{
if ((n < m) || (m < k))
return std::vector<int>();
std::random_device source;
std::mt19937 generator(source());
std::uniform_int_distribution<> distribution(0, n - (m - 1) * k);
std::vector<int> result_list;
result_list.reserve(m);
for (int i = 0; i < m; ++i)
{
result_list.push_back(distribution(generator));
}
std::sort(std::begin(result_list),std::end(result_list));
for (int i = 0; i < m; ++i)
{
result_list[i] += (i * k);
}
return result_list;
}
http://ideone.com/KOeR4R
.
EDIT: I adapted the text for the requirement to create ordered sequences, each with the same probability.
Create random numbers a_i for i=0..M-1 without duplicates. Sort them. Then create numbers
b_i=a_i + i*(K-1)
Given the construction, those numbers b_i have the required gaps, because the a_i already have gaps of at least 1. In order to make sure those b values cover exactly the required range [1..N], you must ensure a_i are picked from a range [1..N-(M-1)*(K-1)]. This way you get truly independent numbers. Well, as independent as possible given the required gap. Because of the sorting you get O(M log M) performance again, but this shouldn't be too bad. Sorting is typically very fast. In Python it looks like this:
import random
def random_list( N, M, K ):
s = set()
while len(s) < M:
s.add( random.randint( 1, N-(M-1)*(K-1) ) )
res = sorted( s )
for i in range(M):
res[i] += i * (K-1)
return res
First off: this will be an attempt to show that there's a bijection between the (M+1)- compositions (with the slight modification that we will allow addends to be 0) of the value N - (M-1)*K and the valid solutions to your problem. After that, we only have to pick one of those compositions uniformly at random and apply the bijection.
Bijection:
Let
Then the xi form an M+1-composition (with 0 addends allowed) of the value on the left (notice that the xi do not have to be monotonically increasing!).
From this we get a valid solution
by setting the values mi as follows:
We see that the distance between mi and mi + 1 is at least K, and mM is at most N (compare the choice of the composition we started out with). This means that every (M+1)-composition that fulfills the conditions above defines exactly one valid solution to your problem. (You'll notice that we only use the xM as a way to make the sum turn out right, we don't use it for the construction of the mi.)
To see that this gives a bijection, we need to see that the construction can be reversed; for this purpose, let
be a given solution fulfilling your conditions. To get the composition this is constructed from, define the xi as follows:
Now first, all xi are at least 0, so that's alright. To see that they form a valid composition (again, every xi is allowed to be 0) of the value given above, consider:
The third equality follows since we have this telescoping sum that cancels out almost all mi.
So we've seen that the described construction gives a bijection between the described compositions of N - (M-1)*K and the valid solutions to your problem. All we have to do now is pick one of those compositions uniformly at random and apply the construction to get a solution.
Picking a composition uniformly at random
Each of the described compositions can be uniquely identified in the following way (compare this for illustration): reserve N - (M-1)*K spaces for the unary notation of that value, and another M spaces for M commas. We get an (M+1)- composition of N - (M-1)*K by choosing M of the N - (M-1)*K + M spaces, putting commas there, and filling the rest with |. Then let x0 be the number of | before the first comma, xM+1 the number of | after the last comma, and all other xi the number of | between commas i and i+1. So all we have to do is pick an M-element subset of the integer interval[1; N - (M-1)*K + M] uniformly at random, which we can do for example with the Fisher-Yates shuffle in O(N + M log M) (we need to sort the M delimiters to build the composition) since M*K needs to be in O(N) for any solutions to exist. So if N is bigger than M by at least a logarithmic factor, then this is linear in N.
Note: #DavidEisenstat suggested that there are more space efficient ways of picking the M-element subset of that interval; I'm not aware of any, I'm afraid.
You can get an error-proof algorithm out of this by doing the simple input validation we get from the construction above that N ≥ (M-1) * K and that all three values are at least 1 (or 0, if you define the empty set as a valid solution for that case).
Why not do this:
for (int i = 0; i < M; ++i) {
pick a random number between K and N/M
add this number to (N/M)* i;
Now you have M random numbers, distributed evenly along N, all of which have a difference of at least K. It's in O(n) time. As an added bonus, it's already sorted. :-)
EDIT:
Actually, the "pick a random number" part shouldn't be between K and N/M, but between min(K, [K - (N/M * i - previous value)]). That would ensure that the differences are still at least K, and not exclude values that should not be missed.
Second EDIT:
Well, the first case shouldn't be between K and N/M - it should be between 0 and N/M. Just like you need special casing for when you get close to the N/M*i border, we need special initial casing.
Aside from that, the issue you brought up in your comments was fair representation, and you're right. As my pseudocode is presented, it currently completely misses the excess between N/M*M and N. It's another edge case; simply change the random values of your last range.
Now, in this case, your distribution will be different for the last range. Since you have more numbers, you have slightly less chance for each number than you do for all the other ranges. My understanding is that because you're using ">>", this shouldn't really impact the distribution, i.e. the difference in size in the sample set should be nominal. But if you want to make it more fair, you divide the excess equally among each range. This makes your initial range calculation more complex - you'll have to augment each range based on how much remainder there is divided by M.
There are lots of special cases to look out for, but they're all able to be handled. I kept the pseudocode very basic just to make sure that the general concept came through clearly. If nothing else, it should be a good starting point.
Third and Final EDIT:
For those worried that the distribution has a forced evenness, I still claim that there's nothing saying it can't. The selection is uniformly distributed in each segment. There is a linear way to keep it uneven, but that also has a trade-off: if one value is selected extremely high (which should be unlikely given a very large N), then all the other values are constrained:
int prevValue = 0;
int maxRange;
for (int i = 0; i < M; ++i) {
maxRange = N - (((M - 1) - i) * K) - prevValue;
int nextValue = random(0, maxRange);
prevValue += nextValue;
store previous value;
prevValue += K;
}
This is still linear and random and allows unevenness, but the bigger prevValue gets, the more constrained the other numbers become. Personally, I prefer my second edit answer, but this is an available option that given a large enough N is likely to satisfy all the posted requirements.
Come to think of it, here's one other idea. It requires a lot more data maintenance, but is still O(M) and is probably the most fair distribution:
What you need to do is maintain a vector of your valid data ranges and a vector of probability scales. A valid data range is just the list of high-low values where K is still valid. The idea is you first use the scaled probability to pick a random data range, then you randomly pick a value within that range. You remove the old valid data range and replace it with 0, 1 or 2 new data ranges in the same position, depending on how many are still valid. All of these actions are constant time other than handling the weighted probability, which is O(M), done in a loop M times, so the total should be O(M^2), which should be much better than O(NlogN) because N >> M.
Rather than pseudocode, let me work an example using OP's original example:
0th iteration: valid data ranges are from [0...100Mill], and the weight for this range is 1.0.
1st iteration: Randomly pick one element in the one element vector, then randomly pick one element in that range.
If the element is, e.g. 12345678, then we remove the [0...100Mill] and replace it with [0...12344678] and [12346678...100Mill]
If the element is, e.g. 500, then we remove the [0...100Mill] and replace it with just [1500...100Mill], since [0...500] is no longer a valid range. The only time we will replace it with 0 ranges is in the unlikely event that you have a range with only one number in it and it gets picked. (In that case, you'll have 3 numbers in a row that are exactly K apart from each other.)
The weight for the ranges are their length over the total length, e.g. 12344678/(12344678 + (100Mill - 12346678)) and (100Mill - 12346678)/(12344678 + (100Mill - 12346678))
In the next iterations, you do the same thing: randomly pick a number between 0 and 1 and determine which of the ranges that scale falls into. Then randomly pick a number in that range, and replace your ranges and scales.
By the time it's done, we're no longer acting in O(M), but we're still only dependent on the time of M instead of N. And this actually is both uniform and fair distribution.
Hope one of these ideas works for you!

How to figure out "progress" while sorting?

I'm using stable_sort to sort a large vector.
The sorting takes on the order of a few seconds (say, 5-10 seconds), and I would like to display a progress bar to the user showing how much of the sorting is done so far.
But (even if I was to write my own sorting routine) how can I tell how much progress I have made, and how much more there is left to go?
I don't need it to be exact, but I need it to be "reasonable" (i.e. reasonably linear, not faked, and certainly not backtracking).
Standard library sort uses a user-supplied comparison function, so you can insert a comparison counter into it. The total number of comparisons for either quicksort/introsort or mergesort will be very close to log2N * N (where N is the number of elements in the vector). So that's what I'd export to a progress bar: number of comparisons / N*log2N
Since you're using mergesort, the comparison count will be a very precise measure of progress. It might be slightly non-linear if the implementation spends time permuting the vector between comparison runs, but I doubt your users will see the non-linearity (and anyway, we're all used to inaccurate non-linear progress bars :) ).
Quicksort/introsort would show more variance, depending on the nature of the data, but even in that case it's better than nothing, and you could always add a fudge factor on the basis of experience.
A simple counter in your compare class will cost you practically nothing. Personally I wouldn't even bother locking it (the locks would hurt performance); it's unlikely to get into an inconsistent state, and anyway the progress bar won't go start radiating lizards just because it gets an inconsistent progress number.
Split the vector into several equal sections, the quantity depending upon the granularity of progress reporting you desire. Sort each section seperately. Then start merging with std::merge. You can report your progress after sorting each section, and after each merge. You'll need to experiment to determine how much percentage the sorting of the sections should be counted compared to the mergings.
Edit:
I did some experiments of my own and found the merging to be insignificant compared to the sorting, and this is the function I came up with:
template<typename It, typename Comp, typename Reporter>
void ReportSort(It ibegin, It iend, Comp cmp, Reporter report, double range_low=0.0, double range_high=1.0)
{
double range_span = range_high - range_low;
double range_mid = range_low + range_span/2.0;
using namespace std;
auto size = iend - ibegin;
if (size < 32768) {
stable_sort(ibegin,iend,cmp);
} else {
ReportSort(ibegin,ibegin+size/2,cmp,report,range_low,range_mid);
report(range_mid);
ReportSort(ibegin+size/2,iend,cmp,report,range_mid,range_high);
inplace_merge(ibegin, ibegin + size/2, iend);
}
}
int main()
{
std::vector<int> v(100000000);
std::iota(v.begin(), v.end(), 0);
std::random_shuffle(v.begin(), v.end());
std::cout << "starting...\n";
double percent_done = 0.0;
auto report = [&](double d) {
if (d - percent_done >= 0.05) {
percent_done += 0.05;
std::cout << static_cast<int>(percent_done * 100) << "%\n";
}
};
ReportSort(v.begin(), v.end(), std::less<int>(), report);
}
Stable sort is based on merge sort. If you wrote your own version of merge sort then (ignoring some speed-up tricks) you would see that it consists of log N passes. Each pass starts with 2^k sorted lists and produces 2^(k-1) lists, with the sort finished when it merges two lists into one. So you could use the value of k as an indication of progress.
If you were going to run experiments, you might instrument the comparison object to count the number of comparisons made and try and see if the number of comparisons made is some reasonably predictable multiple of n log n. Then you could keep track of progress by counting the number of comparisons done.
(Note that with the C++ stable sort, you have to hope that it finds enough store to hold a copy of the data. Otherwise the cost goes from N log N to perhaps N (log N)^2 and your predictions will be far too optimistic).
Select a small subset of indices and count inversions. You know its maximal value, and you know when you are done the value is zero. So, you can use this value as a "progressor". You can think of it as a measure of entropy.
Easiest way to do it: sort a small vector and extrapolate the time assuming O(n log n) complexity.
t(n) = C * n * log(n) ⇒ t(n1) / t(n2) = n1/n2 * log(n1)/log(n2)
If sorting 10 elements takes 1 μs, then 100 elements will take 1 μs * 100/10 * log(100)/log(10) = 20 μs.
Quicksort is basically
partition input using a pivot element
sort smallest part recursively
sort largest part using tail recursion
All the work is done in the partition step. You could do the outer partition directly and then report progress as the smallest part is done.
So there would be an additional step between 2 and 3 above.
Update progressor
Here is some code.
template <typename RandomAccessIterator>
void sort_wReporting(RandomAccessIterator first, RandomAccessIterator last)
{
double done = 0;
double whole = static_cast<double>(std::distance(first, last));
typedef typename std::iterator_traits<RandomAccessIterator>::value_type value_type;
while (first != last && first + 1 != last)
{
auto d = std::distance(first, last);
value_type pivot = *(first + std::rand() % d);
auto iter = std::partition(first, last,
[pivot](const value_type& x){ return x < pivot; });
auto lower = distance(first, iter);
auto upper = distance(iter, last);
if (lower < upper)
{
std::sort(first, iter);
done += lower;
first = iter;
}
else
{
std::sort(iter, last);
done += upper;
last = iter;
}
std::cout << done / whole << std::endl;
}
}
I spent almost one day to figure out how to display the progress for shell sort, so I will leave here my simple formula. Given an array of colors, it will display the progress. It is blending the colors from red to yellow and then to green. When it is Sorted, it is the last position of the array that is blue. For shell sort, the iterations each time it passes through the array are quite proportional, so the progress becomes pretty accurate.
(Code in Dart/Flutter)
List<Color> colors = [
Color(0xFFFF0000),
Color(0xFFFF5500),
Color(0xFFFFAA00),
Color(0xFFFFFF00),
Color(0xFFAAFF00),
Color(0xFF55FF00),
Color(0xFF00FF00),
Colors.blue,
];
[...]
style: TextStyle(
color: colors[(((pass - 1) * (colors.length - 1)) / (log(a.length) / log(2)).floor()).floor()]),
It is basically a cross-multiplication.
a means array. (log(a.length) / log(2)).floor() means rounding down the log2(N), where N means the number of items. I tested this with several combinations of array sizes, array numbers, and sizes for the array of colors, so I think it is good to go.