Generate N random numbers within a range with a constant sum - c++

I want to generate N random numbers drawn from a specif distribution (e.g uniform random) between [a,b] which sum to a constant C. I have tried a couple of solutions I could think of myself, and some proposed on similar threads but most of them either work for a limited form of problem or I can't prove the outcome still follows the desired distribution.
What I have tried:
Generage N random numbers, divide all of them by the sum of them and multiply by the desired constant. This seems to work but the result does not follow the rule that the numbers should be within [a:b].
Generage N-1 random numbers add 0 and desired constant C and sort them. Then calculate the difference between each two consecutive nubmers and the differences are the result. This again sums to C but have the same problem of last method(the range can be bigger than [a:b].
I also tried to generate random numbers and always keep track of min and max in a way that the desired sum and range are kept and come up with this code:
bool generate(function<int(int,int)> randomGenerator,int min,int max,int len,int sum,std::vector<int> &output){
/**
* Not possible to produce such a sequence
*/
if(min*len > sum)
return false;
if(max*len < sum)
return false;
int curSum = 0;
int left = sum - curSum;
int leftIndexes = len-1;
int curMax = left - leftIndexes*min;
int curMin = left - leftIndexes*max;
for(int i=0;i<len;i++){
int num = randomGenerator((curMin< min)?min:curMin,(curMax>max)?max:curMax);
output.push_back(num);
curSum += num;
left = sum - curSum;
leftIndexes--;
curMax = left - leftIndexes*min;
curMin = left - leftIndexes*max;
}
return true;
}
This seems to work but the results are sometimes very skewed and I don't think it's following the original distribution (e.g. uniform). E.g:
//10 numbers within [1:10] which sum to 50:
generate(uniform,1,10,10,50,output);
//result:
2,7,2,5,2,10,5,8,4,5 => sum=50
//This looks reasonable for uniform, but let's change to
//10 numbers within [1:25] which sum to 50:
generate(uniform,1,25,10,50,output);
//result:
24,12,6,2,1,1,1,1,1,1 => sum= 50
Notice how many ones exist in the output. This might sound reasonable because the range is larger. But they really don't look like a uniform distribution.
I am not sure even if it is possible to achieve what I want, maybe the constraints are making the problem not solvable.

In case you want the sample to follow a uniform distribution, the problem reduces to generate N random numbers with sum = 1. This, in turn, is a special case of the Dirichlet distribution but can also be computed more easily using the Exponential distribution. Here is how:
Take a uniform sample v1 … vN with all vi between 0 and 1.
For all i, 1<=i<=N, define ui := -ln vi (notice that ui > 0).
Normalize the ui as pi := ui/s where s is the sum u1+...+uN.
The p1..pN are uniformly distributed (in the simplex of dim N-1) and their sum is 1.
You can now multiply these pi by the constant C you want and translate them by summing some other constant A like this
qi := A + pi*C.
EDIT 3
In order to address some issues raised in the comments, let me add the following:
To ensure that the final random sequence falls in the interval [a,b] choose the constants A and C above as A := a and C := b-a, i.e., take qi = a + pi*(b-a). Since pi is in the range (0,1) all qi will be in the range [a,b].
One cannot take the (negative) logarithm -ln(vi) if vi happens to be 0 because ln() is not defined at 0. The probability of such an event is extremely low. However, in order to ensure that no error is signaled the generation of v1 ... vN in item 1 above must threat any occurrence of 0 in a special way: consider -ln(0) as +infinity (remember: ln(x) -> -infinity when x->0). Thus the sum s = +infinity, which means that pi = 1 and all other pj = 0. Without this convention the sequence (0...1...0) would never be generated (many thanks to #Severin Pappadeux for this interesting remark.)
As explained in the 4th comment attached to the question by #Neil Slater it is logically impossible to fulfill all the requirements of the original framing. Therefore any solution must relax the constraints to a proper subset of the original ones. Other comments by #Behrooz seem to confirm that this would suffice in this case.
EDIT 2
One more issue has been raised in the comments:
Why rescaling a uniform sample does not suffice?
In other words, why should I bother to take negative logarithms?
The reason is that if we just rescale then the resulting sample won't distribute uniformly across the segment (0,1) (or [a,b] for the final sample.)
To visualize this let's think 2D, i.e., let's consider the case N=2. A uniform sample (v1,v2) corresponds to a random point in the square with origin (0,0) and corner (1,1). Now, when we normalize such a point dividing it by the sum s=v1+v2 what we are doing is projecting the point onto the diagonal as shown in the picture (keep in mind that the diagonal is the line x + y = 1):
But given that green lines, which are closer to the principal diagonal from (0,0) to (1,1), are longer than orange ones, which are closer to the axes x and y, the projections tend to accumulate more around the center of the projection line (in blue), where the scaled sample lives. This shows that a simple scaling won't produce a uniform sample on the depicted diagonal. On the other hand, it can be proven mathematically that the negative logarithms do produce the desired uniformity. So, instead of copypasting a mathematical proof I would invite everyone to implement both algorithms and check that the resulting plots behave as this answer describes.
(Note: here is a blog post on this interesting subject with an application to the Oil & Gas industry)

Let's try to simplify the problem.
By substracting the lower bound, we can reduce it to finding N numbers in [0,b-a] such that their sum is C-Na.
Renaming the parameters, we can look for N numbers in [0,m] whose sum is S.
Now the problem is akin to partitioning a segment of length S in N distinct sub-segments of length [0,m].
I think the problem is simply not solvable.
if S=1, N=1000 and m anything above 0, the only possible repartition is one 1 and 999 zeroes, which is nothing like a random spread.
There is a correlation between N, m and S, and even picking random values will not make it disappear.
For the most uniform repartition, the length of the sub-segments will follow a gaussian curve with a mean value of S/N.
If you tweak your random numbers differently, you will end up with whatever bias, but in the end you will never have both a uniform [a,b] repartition and a total length of C, unless the length of your [a,b] interval happens to be 2C/N-a.

For my answer I'll assume that we have a uniform distribution.
Since we have a uniform distribution, every tuple of C has the same probability to occur. For example for a = 2, b = 2, C = 12, N = 5 we have 15 possible tuples. From them 10 start with 2, 4 start with 3 and 1 starts with 4. This gives the idea of selecting a random number from 1 to 15 in order to choose the first element. From 1 to 10 we select 2, from 11 to 14 we select 3 and for 15 we select 4. Then we continue recursively.
#include <time.h>
#include <random>
std::default_random_engine generator(time(0));
int a = 2, b = 4, n = 5, c = 12, numbers[5];
// Calculate how many combinations of n numbers have sum c
int calc_combinations(int n, int c) {
if (n == 1) return (c >= a) && (c <= b);
int sum = 0;
for (int i = a; i <= b; i++) sum += calc_combinations(n - 1, c - i);
return sum;
}
// Chooses a random array of n elements having sum c
void choose(int n, int c, int *numbers) {
if (n == 1) { numbers[0] = c; return; }
int combinations = calc_combinations(n, c);
std::uniform_int_distribution<int> distribution(0, combinations - 1);
int s = distribution(generator);
int sum = 0;
for (int i = a; i <= b; i++) {
if ((sum += calc_combinations(n - 1, c - i)) > s) {
numbers[0] = i;
choose(n - 1, c - i, numbers + 1);
return;
}
}
}
int main() { choose(n, c, numbers); }
Possible outcome:
2
2
3
2
3
This algorithm won't scale well for large N because of overflows in the calculation of combinations (unless we use a big integer library), the time needed for this calculation and the need for arbitrarily large random numbers.

well, for n=10000 cant we have a small number in there that is not random?
maybe generating sequence till sum > C-max reached and then just put one simple number to sum it up.
1 in 10000 is more like a very small noise in the system.

Although this was old topic but I think I got a idea. Consider we want N random number which sum is C and each random between a and b. To solve problem, we create N holes and prepare C balls, for each time we ask each hole "Do you want another ball?". If no, we pass to next hole, else, we put a ball into the hole. Each hole has a cap value: b-a. If some hole reach the cap value then always pass to next hole.
Example:
3 random numbers between 0 and 2 which sum is 5.
simulation result:
1st run: -+-
2nd run: ++-
3rd run: ---
4th run: +*+
final:221
-:refuse ball
+:accept ball
*:full pass

Related

Numbers of common distinct difference

Given two array A and B. Task to find the number of common distinct (difference of elements in two arrays).
Example :
A=[3,6,8]
B=[1,6,10]
so we get differenceSet for A
differenceSetA=[abs(3-6),abs(6-8),abs(8-3)]=[3,5,2]
similiarly
differenceSetB=[abs(1-6),abs(1-10),abs(6-10)]=[5,9,4]
Number of common elements=Intersection :{differenceSetA,differenceSetB}={5}
Answer= 1
My approach O(N^2)
int commonDifference(vector<int> A,vector<int> B){
int n=A.size();
int m=B.size();
unordered_set<int> differenceSetA;
unordered_set<int> differenceSetB;
for(int i=0;i<n;i++){
for(int j=i+1;j<n;j++){
differenceSetA.insert(abs(A[i]-A[j]));
}
}
for(int i=0;i<m;i++){
for(int j=i+1;j<m;j++){
differenceSetB.insert(abs(B[i]-B[j]));
}
}
int count=0;
for(auto &it:differenceSetA){
if(differenceSetB.find(it)!=differenceSetB.end()){
count++;
}
}
return count;
}
Please provide suggestions for optimizing the approach in O(N log N)
If n is the maximum range of a input array, then the set of all differences of a given array can be obtained in O(n logn), as explained in this SO post: find all differences in a array
Here is a brief recall of the method, with a few additional practical implementation details:
Create an array Posi of length 2*n = 2*range = 2*(Vmax - Vmin + 1), where elements whose index matches an element of the input are set to 1, other elements are set to 0. This can be created in O(m), where m is the size of the array.
For example, given in input array [1,4,5] of size m, we create an array [1,0,0,1,1].
Initialisation: Posi[i] = 0 for all i (i = 0 to 2*n)
Posi[A[i] - Vmin] = 1 (i = 0 to m)
Calculate the autocorrelation function of array Posi[]. This can be classically performed in three sub-steps
2.1 Calculate the FFT (size 2*n) of Posi[]array: Y[] = FFT(Posi)
2.2 Calculate the square amplitude of the result: Y2[k] = Y[k] * conj([Y[k])
2.3 Calculate the Inverse FFT of the result Diff[] = IFFT (Y2[])`
A few details are worth being mentioned here:
The reason why a size 2*n was selected, and not a size n, if that, is d is a valid difference, then -d is also a valid difference. The results corresponding to negative differences are available at positions i >= n
If you find more easy to perform FFT with a size a-power-of-two, than you can replace the size 2*n with a value n2k = 2^k, with n2k >= 2*n
The non-null differences correspond to non-null values in the array Diff[]:
`d` is a difference if `Diff[d] > 0`
Another important details is that a classical FFT is used (float calculations), then you encounter little errors. To take it into account, it is important to replace the IFFT output Diff[] with integer rounded values of the real part.
All that concerns one array only. As you want to calculate the number of common differences, then you have to:
calculate the arrays Diff_A[] and Diff_B[] for both sets A and B and then:
count = 0;
if (Diff_A[d] != 0) and (Diff_B[d] != 0) then count++;
A little Bonus
In order to avoid a plagiarism of the mentioned post, here is an additional explanation about the way to get the differences of one set, with the help of the FFT.
The input array A = {3, 6, 8} can mathematically be represented by the following z transform:
A(z) = z^3 + z^6 + z^8
Then the corresponding z-transform of the difference array is equal to the polynomial product:
D(z) = A(z) * A(z*) = (z^3 + z^6 + z^8) (z^(-3) + z^(-6) + z^(-8))
= z^(-5) + z^(-3) + z^(-2) + 3 + z^2 + z^3 + z^5
Then, we can note that A(z) is equal to a FFT of size N of the sequence [0 0 0 1 0 0 1 0 1] by taking:
z = exp (-i * 2 PI/ N), with i = sqrt(-1)
Note that here we consider the classical FFT in C, the complex field.
It is certainly possible to perform calculation in a Galois field, and then no rounding errors, as it is done for example to implement "classical" multiplications (with z = 10) for a large number of digits. This seems over-skilled here.

A problem of taking combination for set theory

Given an array A with size N. Value of a subset of Array A is defined as product of all numbers in that subset. We have to return the product of values of all possible non-empty subsets of array A %(10^9+7).
E.G. array A {3,5}
` Value{3} = 3,
Value{5} = 5,
Value{3,5} = 5*3 = 15
answer = 3*5*15 %(10^9+7).
Can someone explain the mathematics behind the problem. I am thinking of solving it by combination to solve it efficiently.
I have tried using brute force it gives correct answer but it is way too slow.
Next approach is using combination. Now i think that if we take all the sets and multiply all the numbers in those set then we will get the correct answer. Thus i have to find out how many times a number is coming in calculation of answer. In the example 5 and 3 both come 2 times. If we look closely, each number in a will come same number of times.
You're heading in the right direction.
Let x be an element of the given array A. In our final answer, x appears p number of times, where p is equivalent to the number of subsets of A possible that include x.
How to calculate p? Once we have decided that we will definitely include x in our subset, we have two choices for the rest N-1 elements: either include them in set or do not. So, we conclude p = 2^(N-1).
So, each element of A appears exactly 2^(N-1) times in the final product. All remains is to calculate the answer: (a1 * a2 * ... * an)^p. Since the exponent is very large, you can use binary exponentiation for fast calculation.
As Matt Timmermans suggested in comments below, we can obtain our answer without actually calculating p = 2^(N-1). We first calculate the product a1 * a2 * ... * an. Then, we simply square this product n-1 times.
The corresponding code in C++:
int func(vector<int> &a) {
int n = a.size();
int m = 1e9+7;
if(n==0) return 0;
if(n==1) return (m + a[0]%m)%m;
long long ans = 1;
//first calculate ans = (a1*a2*...*an)%m
for(int x:a){
//negative sign does not matter since we're squaring
if(x<0) x *= -1;
x %= m;
ans *= x;
ans %= m;
}
//now calculate ans = [ ans^(2^(n-1)) ]%m
//we do this by squaring ans n-1 times
for(int i=1; i<n; i++){
ans = ans*ans;
ans %= m;
}
return (int)ans;
}
Let,
A={a,b,c}
All possible subset of A is ={{},{a},{b},{c},{a,b},{b,c},{c,a},{a,b,c,d}}
Here number of occurrence of each of the element are 4 times.
So if A={a,b,c,d}, then numbers of occurrence of each of the element will be 2^3.
So if the size of A is n, number of occurrence of eachof the element will be 2^(n-1)
So final result will be = a1^p*a2^pa3^p....*an^p
where p is 2^(n-1)
We need to solve x^2^(n-1) % mod.
We can write x^2^(n-1) % mod as x^(2^(n-1) % phi(mod)) %mod . link
As mod is a prime then phi(mod)=mod-1.
So at first find p= 2^(n-1) %(mod-1).
Then find Ai^p % mod for each of the number and multiply with the final result.
I read the previous answers and I was understanding the process of making sets. So here I am trying to put it in as simple as possible for people so that they can apply it to similar problems.
Let i be an element of array A. Following the approach given in the question, i appears p number of times in final answer.
Now, how do we make different sets. We take sets containing only one element, then sets containing group of two, then group of 3 ..... group of n elements.
Now we want to know for every time when we are making set of certain numbers say group of 3 elements, how many of these sets contain i?
There are n elements so for sets of 3 elements which always contains i, combinations are (n-1)C(3-1) because from n-1 elements we can chose 3-1 elements.
if we do this for every group, p = [ (n-1)C(x-1) ] , m going from 1 to n. Thus, p= 2^(n-1).
Similarly for every element i, p will be same. Thus we get
final answer= A[0]^p *A[1]^p...... A[n]^p

Given an integer n, return the number of ways it can be represented as a sum of 1s and 2s

For example:
5 = 1+1+1+1+1
5 = 1+1+1+2
5 = 1+1+2+1
5 = 1+2+1+1
5 = 2+1+1+1
5 = 1+2+2
5 = 2+2+1
5 = 2+1+2
Can anyone give a hint for a pseudo code on how this can be done please.
Honestly have no clue how to even start.
Also this looks like an exponential problem can it be done in linear time?
Thank you.
In the example you have provided order of addends is important. (See the last two lines in your example). With this in mind, the answer seems to be related to Fibonacci numbers. Let's F(n) be the ways n can be written as 1s and 2s. Then the last addened is either 1 or 2. So F(n) = F(n-1) + F(n-2). These are the initial values:
F(1) = 1 (1 = 1)
F(2) = 2 (2 = 1 + 1, 2 = 2)
This is actually the (n+1)th Fibonacci number. Here's why:
Let's call f(n) the number of ways to represent n. If you have n, then you can represent it as (n-1)+1 or (n-2)+2. Thus the ways to represent it are the number of ways to represent it is f(n-1) + f(n-2). This is the same recurrence as the Fibonacci numbers. Furthermore, we see if n=1 then we have 1 way, and if n=2 then we have 2 ways. Thus the (n+1)th Fibonacci number is your answer. There are algorithms out there to compute enormous Fibonacci numbers very quickly.
Permutations
If we want to know how many possible orderings there are in some set of size n without repetition (i.e., elements selected are removed from the available pool), the factorial of n (or n!) gives the answer:
double factorial(int n)
{
if (n <= 0)
return 1;
else
return n * factorial(n - 1);
}
Note: This also has an iterative solution and can even be approximated using the gamma function:
std::round(std::tgamma(n + 1)); // where n >= 0
The problem set starts with all 1s. Each time the set changes, two 1s are replaced by one 2. We want to find the number of ways k items (the 2s) can be arranged in a set of size n. We can query the number of possible permutations by computing:
double permutation(int n, int k)
{
return factorial(n) / factorial(n - k);
}
However, this is not quite the result we want. The problem is, permutations consider ordering, e.g., the sequence 2,2,2 would count as six distinct variations.
Combinations
These are essentially permutations which ignore ordering. Since the order no longer matters, many permutations are redundant. Redundancy per permutation can be found by computing k!. Dividing the number of permutations by this value gives the number of combinations:
Note: This is known as the binomial coefficient and should be read as "n choose k."
double combination(int n, int k)
{
return permutation(n, k) / factorial(k);
}
int solve(int n)
{
double result = 0;
if (n > 0) {
for ( int k = 0; k <= n; k += 1, n -= 1 )
result += combination(n, k);
}
return std::round(result);
}
This is a general solution. For example, if the problem were instead to find the number of ways an integer can be represented as a sum of 1s and 3s, we would only need to adjust the decrement of the set size (n-2) at each iteration.
Fibonacci numbers
The reason the solution using Fibonacci numbers works, has to do with their relation to the binomial coefficients. The binomial coefficients can be arranged to form Pascal's triangle, which when stored as a lower-triangular matrix, can be accessed using n and k as row/column indices to locate the element equal to combination(n,k).
The pattern of n and k as they change over the lifetime of solve, plot a diagonal when viewed as coordinates on a 2-D grid. The result of summing values along a diagonal of Pascal's triangle is a Fibonacci number. If the pattern changes (e.g., when finding sums of 1s and 3s), this will no longer be the case and this solution will fail.
Interestingly, Fibonacci numbers can be computed in constant time. Which means we can solve this problem in constant time simply by finding the (n+1)th Fibonacci number.
int fibonacci(int n)
{
constexpr double SQRT_5 = std::sqrt(5.0);
constexpr double GOLDEN_RATIO = (SQRT_5 + 1.0) / 2.0;
return std::round(std::pow(GOLDEN_RATIO, n) / SQRT_5);
}
int solve(int n)
{
if (n > 0)
return fibonacci(n + 1);
return 0;
}
As a final note, the numbers generated by both the factorial and fibonacci functions can be extremely large. Therefore, a large-maths library may be needed if n will be large.
Here is the code using backtracking which solves your problem. At each step, while remembering the numbers used to get the sum so far(using vectors here), first make a copy of them, first subtract 1 from n and add it to the copy then recur with n-1 and the copy of the vector with 1 added to it and print when n==0. then return and repeat the same for 2, which essentially is backtracking.
#include <stdio.h>
#include <vector>
#include <iostream>
using namespace std;
int n;
void print(vector<int> vect){
cout << n <<" = ";
for(int i=0;i<vect.size(); ++i){
if(i>0)
cout <<"+" <<vect[i];
else cout << vect[i];
}
cout << endl;
}
void gen(int n, vector<int> vect){
if(!n)
print(vect);
else{
for(int i=1;i<=2;++i){
if(n-i>=0){
std::vector<int> vect2(vect);
vect2.push_back(i);
gen(n-i,vect2);
}
}
}
}
int main(){
scanf("%d",&n);
vector<int> vect;
gen(n,vect);
}
This problem can be easily visualized as follows:
Consider a frog, that is present in front of a stairway. It needs to reach the n-th stair, but he can only jump 1 or 2 steps on the stairway at a time. Find the number of ways in which he can reach the n-th stair?
Let T(n) denote the number of ways to reach the n-th stair.
So, T(1) = 1 and T(2) = 2(2 one-step jumps or 1 two-step jump, so 2 ways)
In order to reach the n-th stair, we already know the number of ways to reach the (n-1)th stair and the (n-2)th stair.
So, once can simple reach the n-th stair by a 1-step jump from (n-1)th stair or a 2-step jump from (n-2)th step...
Hence, T(n) = T(n-1) + T(n-2)
Hope it helps!!!

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.
The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.
I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.
If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:
assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
if (rnd(engine) * (n - i) < k) {
std::cout << i << std::endl;
k--;
}
}
It produces all ascending sequences with equal probability.
You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:
int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint.
{
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(0.0,1.0);
int successes = 0;
for(int trial = 0; trial < n; trial++)
{
if((int)(distribution(generator) * N) < K)
{
successes++;
K--;
}
N--;
}
return successes;
}
select_k_from_n(int start, int k, int n)
{
if(k == 0)
return;
if(k == 1)
{
output start + random(1 to n);
return;
}
// find the number of results below the mid-point:
int k1 = sample_hypergeometric(k, n >> 1, n);
select_k_from_n(start, k1, n >> 1);
select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
}
Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).
As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:
#include <random>
#include <set>
int main(void) {
std::set<int> random_set;
std::random_device rd;
std::mt19937 mt_eng(rd());
// min and max of random set range
const int m = 0; // min
const int n = 100; // max
std::uniform_int_distribution<> dist(m,n);
// number to generate
const int k = 50;
for (int i = 0; i < k; ++i) {
// only non-previously occurring values will be inserted
if (!random_set.insert(dist(mt_eng)).second)
--i;
}
}
Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.
Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.
An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.
So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.
Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.
This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.
This pseudocode shows the basic idea.
pick_k_from_n(n, k)
{
num_left = k
last_k = 0;
while num_left > 0
{
// divide the remaining range into num_left partitions
range_size = (n - last_k) / num_left
// pick a number in the first partition
r = random(range_size) + last_k + 1
output(r)
last_k = r
num_left = num_left - 1
}
}
Note that this takes O(k) time and requires O(1) extra space.
You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).
To select k items from among n:
for j = n-k+1 to n:
select random x from 1 to j
if x is already in hash:
insert j into hash
else
insert x into hash
That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).
Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?
IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?
Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.
For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.
# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
counts = [0] * b # list of b zeros
for j in range(n - k, n):
t = random.randrange(j + 1)
if t // b >= counts[t % b]: # intuitively, "t is not in the set"
counts[t % b] += 1
else:
counts[j % b] += 1
return counts

Algorithm for finding the maximum number of non-overlapping lines on the x axis

I'm not exactly sure how to ask this, but I'll try to be as specific as possible.
Imagine a tetris screen with only rectangles, of different shapes, falling to the bottom.
I want to compute the maximum number of rectangles that I can fit one next to the other without any overlapping ones. I've named them lines in the title because I'm actually only interested in the length of the rectangle when computing, or the line parallel to the x axis that it's falling towards.
So basically I have a custom type with a start and end, both integers between 0 and 100. Say we have a list of these rectangles ranging from 1 to n. rectangle_n.start (unless it's the rectangle closest to the origin) has to be > rectangle_(n-1).end so that they will never overlap.
I'm reading the rectangle coordinates (both are x axis coordinates) from a file with random numbers.
As an example:
consider this list of rectangle type objects
rectangle_list {start, end} = {{1,2}, {3,5}, {4,7} {9,12}}
We can observe that the 3rd object has its start coordinate 4 < the previous rectangle's end coordinate which is 5. So in sorting this list, I would have to remove the 2nd or the 3rd object so that they don't overlap.
I'm not sure if there is a type for this kind of problem so I didn't know how else to name it. I'm interested in an algorithm that can be applied on a list of such objects and would sort them out accordingly.
I've tagged this with c++ because the code I'm writing is c++ but any language would do for the algorithm.
You are essentially solving the following problem. Suppose we have n intervals {[x_1,y_1),[x_2,y_2),...,[x_n,y_n)} with x_1<=x_2<=...<=x_n. We want to find a maximal subset of these intervals such that there are no overlaps between any intervals in the subset.
The naive solution is dynamic programming. It guarantees to find the best solution. Let f(i), 0<=i<=n, be the size of the maximal subset up to interval [x_i,y_i). We have equation (this is latex):
f(i)=\max_{0<=j<i}{f(j)+d(i,j)}
where d(i,j)=1 if and only if [x_i,y_i) and [x_j,y_j) have no overlaps; otherwise d(i,j) takes zero. You can iteratively compute f(i), starting from f(0)=0. f(n) gives the size of the maximal subset. To get the actual subset, you need to keep a separate array s(i)=\argmax_{0<=j<i}{f(j)+d(i,j)}. You then need to backtrack to get the 'path'.
This is an O(n^2) algorithm: you need to compute each f(i) and for each f(i) you need i number of tests. I think there should be a O(nlogn) algorithm, but I am not so sure.
EDIT: an implementation in Lua:
function find_max(list)
local ret, f, b = {}, {}, {}
f[0], b[0] = 0, 0
table.sort(list, function(a,b) return a[1]<b[1] end)
-- dynamic programming
for i, x in ipairs(list) do
local max, max_j = 0, -1
x = list[i]
for j = 0, i - 1 do
local e = j > 0 and list[j][2] or 0
local score = e <= x[1] and 1 or 0
if f[j] + score > max then
max, max_j = f[j] + score, j
end
end
f[i], b[i] = max, max_j
end
-- backtrack
local max, max_i = 0, -1
for i = 1, #list do
if f[i] > max then -- don't use >= here
max, max_i = f[i], i
end
end
local i, ret = max_i, {}
while true do
table.insert(ret, list[i])
i = b[i]
if i == 0 then break end
end
return ret
end
local l = find_max({{1,2}, {4,7}, {3,5}, {8,11}, {9,12}})
for _, x in ipairs(l) do
print(x[1], x[2])
end
The name of this problem is bin packing, it is usually considered as a hard problem but can be computed reasonably well for small number of bins.
Here is a video explaining common approaches to this problem
EDIT : By hard problem, I mean that some kind of brute force has to be employed. You will have to evaluate a lot of solutions and reject most of them, so usually you need some kind of evaluation mechanism. You need to be able to compare solution, such as "This solution packs 4 rectangles with area of 15" is better than "This solution packs 3 rectangles with area of 16".
I can't think of a shortcut, so you may have to enumerate the power set in descending order of size and stop on the first match.
The straightforward way to do this is to enumerate combinations of decreasing size. You could do something like this in C++11:
template <typename I>
std::set<Span> find_largest_non_overlapping_subset(I start, I finish) {
std::set<Span> result;
for (size_t n = std::distance(start, finish); n-- && result.empty();) {
enumerate_combinations(start, finish, n, [&](I begin, I end) {
if (!has_overlaps(begin, end)) {
result.insert(begin, end);
return false;
}
return true;
});
}
return result;
}
The implementation of enumerate_combination is left as an exercise. I assume you already have has_overlap.