How to generate random permutations with CUDA - c++

What parallel algorithms could I use to generate random permutations from a given set?
Especially proposals or links to papers suitable for CUDA would be helpful.
A sequential version of this would be the Fisher-Yates shuffle.
Let S={1, 2, ..., 7} be the set of source indices.
The goal is to generate n random permutations in parallel.
Each of the n permutations contains each of the source indices exactly once,
e.g. {7, 6, ..., 1}.

Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):
const int id = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id] = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
unsigned int shift = 1;
unsigned int pos = id * 2;
while(shift <= BLOCK_SIZE)
if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
shift = shift << 1;
pos = (pos & ~shift) | ((pos & shift) >> 1);
Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].
Note that the code above has the following assumptions:
The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)
To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.

If the length of s = s_L, a very crude way of doing this could be implemented in thrust:
First, create a vector val of length s_L x n that repeats s n times.
Create a vector val_keys associate n unique keys repeated s_L times with each element of val, e.g.,
val = {1,2,...,7,1,2,...,7,....,1,2,...7}
val_keys = {0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,...., n,n,n}
Now the fun part. create a vector of length s_L x n uniformly distributed random variables
U = {0.24, 0.1, .... , 0.83}
then you can do zip iterator over val,val_keys and sort them according to U:
both val, val_keys will be all over the place, so you have to put them back together again using thrust::stable_sort_by_key() to make sure that if val[i] and val[j] both belong to key[k] and val[i] precedes val[j] following the random sort, then in the final version val[i] should still precede val[j]. If all goes according to plan, val_keys should look just as before, but val should reflect the shuffling.

For large sets, using a sort primitive on a vector of randomized keys might be efficient enough for your needs. First, setup some vectors:
const int N = 65535;
thrust:device_vector<uint16_t> d_cards(N);
thrust:device_vector<uint16_t> d_keys(N);
thrust::sequence(d_cards.begin(), d_cards.end());
Then, each time you want to shuffle the d_cards call the pair of:
thrust::tabulate(d_keys.begin(), d_keys.end(), PRNFunc(rand()*rand());
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_cards.begin());
// d_cards now freshly shuffled
The random keys are generated from a functor that uses a seed (evaluated in host-code and copied to the kernel at launch-time) and a key number (which tabulate passes in at thread-creation time):
struct PRNFunc
uint32_t seed;
PRNFunc(uint32_t s) { seed = s; }
__device__ __host__ uint32_t operator()(uint32_t kn) const
thrust::minstd_rand randEng(seed);
return randEnd();
I have found that performance could be improved (by probably 30%) if I could figure out how to cache the allocations that thrust::sort_by_key does internally.
Any corrections or suggestions welcome.


Efficient algorithm to produce closest triplet from 3 arrays?

I need to implement an algorithm in C++ that, when given three arrays of unequal sizes, produces triplets a,b,c (one element contributed by each array) such that max(a,b,c) - min(a,b,c) is minimized. The algorithm should produce a list of these triplets, in order of size of max(a,b,c)-min(a,b,c). The arrays are sorted.
I've implemented the following algorithm (note that I now use arrays of type double), however it runs excruciatingly slow (even when compiled using GCC with -03 optimization, and other combinations of optimizations). The dataset (and, therefore, each array) has potentially tens of millions of elements. Is there a faster/more efficient method? A significant speed increase is necessary to accomplish the required task in a reasonable time frame.
void findClosest(vector<double> vec1, vector<double> vec2, vector<double> vec3){
//calculate size of each array
int len1 = vec1.size();
int len2 = vec2.size();
int len3 = vec3.size();
int i = 0; int j = 0; int k = 0; int res_i, res_j, res_k;
int diff = INT_MAX;
int iter = 0; int iter_bound = min(min(len1,len2),len3);
while(iter < iter_bound)
while(i < len1 && j < len2 && k < len3){
int minimum = min(min(vec1[i], vec2[j]), vec3[k]);
int maximum = max(max(vec1[i], vec2[j]), vec3[k]);
//if new difference less than previous difference, update difference, store
if(fabs(maximum - minimum) < diff){ diff = maximum-minimum; res_i = i; res_j = j; res_k = k;}
//increment minimum value
if(vec1[i] == minimum) ++i;
else if(vec2[j] == minimum) ++j;
else ++k;
//"remove" triplet
vec1.erase(vec1.begin() + res_i);
vec2.erase(vec2.begin() + res_j);
vec3.erase(vec3.begin() + res_k);
--len1; --len2; --len3;
OK, you're going to need to be clever in a few ways to make this run well.
The first thing that you need is a priority queue, which is usually implemented with a heap. With that, the algorithm in pseudocode is:
Make a priority queue for possible triples in order of max - min, then how close median is to their average.
Make a pass through all 3 arrays, putting reasonable triples for every element into the priority queue
While the priority queue is not empty:
Pull a triple out
If all three of the triple are not used:
Add triple to output
Mark the triple used
If you can construct reasonable triplets for unused elements:
Add them to the queue
Now for this operation to succeed, you need to efficiently find elements that are currently unused. Doing that at first is easy, just keep an array of bools where you mark off the indexes of the used values. But once a lot have been taken off, your search gets long.
The trick for that is to have a vector of bools for individual elements, a second for whether both in a pair have been used, a third for where all 4 in a quadruple have been used and so on. When you use an element just mark the individual bool, then go up the hierarchy, marking off the next level if the one you're paired with is marked off, else stopping. This additional data structure of size 2n will require an average of marking 2 bools per element used, but allows you to find the next unused index in either direction in at most O(log(n)) steps.
The resulting algorithm will be O(n log(n)).

Efficiently randomly shuffling the bits of a sequence of words

Consider the following algorithm from the C++ standard library: std::shuffle that has the following signature:
template <class RandomIt, class URBG>
void shuffle(RandomIt first, RandomIt last, URBG&& g);
It reorders the elements in the given range [first, last) such that each possible permutation of those elements has equal probability of appearance.
I am trying to implement the same algorithms, but which works at the bit level, randomly shuffling the bits of the words of the input sequence. Considering a sequence of 64-bits words, I am trying to implement:
template <class URBG>
void bit_shuffle(std::uint64_t* first, std::uint64_t* last, URBG&& g)
Question: How to do that as efficiently as possible (using compiler intrinsics if necessary)? I am not necessarily looking for an entire implementation, but more for suggestions/directions of research, because it's really not clear to me if it's even feasible to implement that efficiently.
It's obvious that asymptotically, the speed is O(N), where N is number of bits. Our goal is to improve the constants involved in it.
Disclaimer: the description proposed algorithm is a rough sketch. There are a lot of stuffs needs to be added and, especially, a lot of details that needs to be cared of in order to make it work correctly. The approximated execution time will not be different from what is claimed here though.
Baseline Algorithm
The most obvious one is the textbook approach, which takes N operations, each of which involves calling the random_generator which takes R milliseconds, and accessing the bit's value of two different bits, and set new value to them in total of 4 * A milliseconds (A is time to read/write one bit). Suppose that the array lookup operations takes C milliseconds. So the total time of this algorithm is N * (R + 4 * A + 2 * C) milliseconds (approximately). It is also reasonable to assume that the random number generation takes more time, i.e. R >> A == C.
Proposed Algorithm
Suppose the bits are stored in a byte storage, i.e. we will work with blocks of bytes.
unsigned char bit_field[field_size = N / 8];
First, let's count the number of 1 bits in our bitset. For that, we can use a lookup-table and iterate through the bitset as byte array:
# Generate lookup-table, you may modify it with `constexpr`
# to make it run in compile time.
int bitcount_lookup[256];
for (int = 0; i < 256; ++i) {
bitcount_lookup[i] = 0;
for (int b = 0; b < 8; ++b)
bitcount_lookup[i] += (i >> b) & 1;
We can treat this as preprocessing overhead (as it may as well be calculated at compile-time) and say that it takes 0 milliseconds. Now, counting number of 1 bits is easy (the following will take (N / 8) * C milliseconds):
int bitcount = 0;
for (auto *it = bit_field; it != bit_field + field_size; ++it)
bitcount += bitcount_lookup[*it];
Now, we randomly generate N / 8 numbers (let's call the resulting array gencnt[N / 8]), each in the range [0..8], such that they sums up to bitcount. This is a bit tricky and kind of hard to do it uniformly (the "correct" algorithm to generate uniform distribution is quite slow comparing to the baseline algo). A quite uniform-ish but quick solution is roughly:
Fill the gencnt[N / 8] array with values v = bitcount / (N / 8).
Randomly choose N / 16 "black" cells. The rests are "white". The algorithm is similar to random permutation, but only of half of the array.
Generate N / 16 random numbers in the range [0..v]. Let's call them tmp[N / 16].
Increase "black" cells by tmp[i] values, and decrease "white" cells by tmp[i]. This will ensure that the overall sum is bitcount.
After that, we will have a uniform-ish random-ish array gencnt[N / 8], the value of which are the number of 1 bytes in a particular "cell". It was all generated in:
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C)
^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^
filling step random coloring filling
milliseconds (this estimation is done with a concrete implementation in my mind). Lastly, we can have a lookup table of the bytes with specified number of bits set to 1 (can be compiled overhead, or even in compile-time as constexpr, so let's assume that this takes 0 milliseconds):
std::vector<std::vector<unsigned char>> random_lookup(8);
for (int c = 0; c < 8; c++)
random_lookup[c] = { /* numbers with `c` bits set to `1` */ };
Then, we can fill our bit_field as follows (which takes roughly (N / 8) * (R + 3 * C) milliseconds):
for (int i = 0; i < field_size; i++) {
bit_field[i] = random_lookup[gencnt[i]][rand() % gencnt[i].size()];
Summing everything up, we have the total execution time:
T = (N / 8) * C +
(N / 8) * C + (N / 16) * (4 * C) + (N / 16) * (R + 2 * C) +
(N / 8) * (R + 3 * C)
= N * (C + (3/16) * R) < N * (R + 4 * A + 2 * C)
^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^
proposed algorithm naive baseline algo
Although it's not truly uniformly random, but it does spread the bits out quite evenly and randomly, and it's quite fast and hopefully gets the job done in your use-case.
Observing that actual shuffling bits, which involves swapping via Fisher-Yates, is not required for producing the exact equivalent, a random distribution of the bits.
#include <iostream>
#include <vector>
#include <random>
// shuffle a vector of bools. This requires only counting the number of trues in the vector
// followed by clearing the vector and inserting bool trues to produce an equivalent to
// a bit shuffle. This is cache line friendly and doesn't require swapping.
std::vector<bool> DistributeBitsRandomly(std::vector<bool> bvector)
std::random_device rd;
static std::mt19937 gen(rd()); //mersenne_twister_engine seeded with rd()
// count the number of set bits and clear bvector
int set_bits_count = 0;
for (int i=0; i < bvector.size(); i++)
if (bvector[i])
bvector[i] = 0;
// set a bit if a random value in range bvector.size()-bit_loc-1 is
// less than the number of bits remaining to be placed. This produces exactly the same
// distribution as a random shuffle but only does an insertion of a 1 bit rather than
// a swap. It requires counting the number of 1 bits. There are efficient ways
// of doing this. See
for (int bit_loc = 0; set_bits_count; bit_loc++)
std::uniform_int_distribution<int> dis(0, bvector.size()-bit_loc-1);
auto x = dis(gen);
if (x < set_bits_count)
bvector[bit_loc] = true;
return bvector;
This performs the equivalent of shuffling the bools in a vector<bool> It is cache line friendly and involves no swapping. It's presented in executable, but simple algorithmic form as requested by the OP. Much can be done to optimize this such as improving the speed of bit counting and clearing the array.
This sets 4 bits out of 10, calls the "shuffle" routine 100,000 times, and prints the number of time a 1 bit occurs in each of the 10 locations. It should be around 40,000 in each position.
int main()
std::vector<bool> initial{ 1,1,1,1,0,0,0,0,0,0 };
std::vector<int> totals(initial.size());
for (int i = 0; i < 100000; i++)
auto a_distribution = DistributeBitsRandomly(initial);
for (int ii = 0; ii < totals.size(); ii++)
if (a_distribution[ii])
for (auto cnt : totals)
std::cout << cnt << "\n";
Possible Output:

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.
The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.
I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.
If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:
assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
if (rnd(engine) * (n - i) < k) {
std::cout << i << std::endl;
It produces all ascending sequences with equal probability.
You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:
int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint.
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(0.0,1.0);
int successes = 0;
for(int trial = 0; trial < n; trial++)
if((int)(distribution(generator) * N) < K)
return successes;
select_k_from_n(int start, int k, int n)
if(k == 0)
if(k == 1)
output start + random(1 to n);
// find the number of results below the mid-point:
int k1 = sample_hypergeometric(k, n >> 1, n);
select_k_from_n(start, k1, n >> 1);
select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).
As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:
#include <random>
#include <set>
int main(void) {
std::set<int> random_set;
std::random_device rd;
std::mt19937 mt_eng(rd());
// min and max of random set range
const int m = 0; // min
const int n = 100; // max
std::uniform_int_distribution<> dist(m,n);
// number to generate
const int k = 50;
for (int i = 0; i < k; ++i) {
// only non-previously occurring values will be inserted
if (!random_set.insert(dist(mt_eng)).second)
Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.
Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.
An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.
So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.
Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.
This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.
This pseudocode shows the basic idea.
pick_k_from_n(n, k)
num_left = k
last_k = 0;
while num_left > 0
// divide the remaining range into num_left partitions
range_size = (n - last_k) / num_left
// pick a number in the first partition
r = random(range_size) + last_k + 1
last_k = r
num_left = num_left - 1
Note that this takes O(k) time and requires O(1) extra space.
You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).
To select k items from among n:
for j = n-k+1 to n:
select random x from 1 to j
if x is already in hash:
insert j into hash
insert x into hash
That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).
Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?
IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?
Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.
For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.
# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
counts = [0] * b # list of b zeros
for j in range(n - k, n):
t = random.randrange(j + 1)
if t // b >= counts[t % b]: # intuitively, "t is not in the set"
counts[t % b] += 1
counts[j % b] += 1
return counts

Whats the efficient way to sum up the elements of an array in following way?

Suppose you are given an n sized array A and a integer k
Now you have to follow this function:
long long sum(int k)
long long sum=0;
for(int i=0;i<n;i++){
return sum;
what is the most efficient way to find sum?
EDIT: if I am given m(<=100000) queries, and given a different k every time, it becomes very time consuming.
If set of queries changes with each k then you can't do better than in O(n). Your only options for optimizing is to use multiple threads (each thread sums some region of array) or at least ensure that your loop is properly vectorized by compiler (or write vectorized version manually using intrinsics).
But if set of queries is fixed and only k is changed, then you may do in O(log n) by using following optimization.
Preprocess array. This is done only once for all ks:
Sort elements
Make another array of the same length which contains partial sums
For example:
inputArray: 5 1 3 8 7
sortedArray: 1 3 5 7 8
partialSums: 1 4 9 16 24
Now, when new k is given, you need to perform following steps:
Make binary search for given k in sortedArray -- returns index of maximal element <= k
Result is partialSums[i] + (partialSums.length - i) * k
You can do way better than that if you can sort the array A[i] and have a secondary array prepared once.
The idea is:
Count how many items are less than k, and just compute the equivalent sum by the formula: count*k
Prepare an helper array which will give you the sum of the items superior to k directly
Step 1: sort the array
std::sort(begin(A), end(A));
Step 2: prepare an helper array
std::vector<long long> p_sums(A.size());
std::partial_sum(rbegin(A), rend(A), begin(p_sums));
long long query(int k) {
// first skip all items whose value is below k strictly
auto it = std::lower_bound(begin(A), end(A), k);
// compute the distance (number of items skipped)
auto index = std::distance(begin(A), it);
// do the sum
long long result = index*k + p_sums[index];
return result;
The complexity of the query is: O(log(N)) where N is the length of the array A.
The complexity of the preparation is: O(N*log(N)). We could go down to O(N) with a radix sort but I don't think it is useful in your case.
What you do seems absolutely fine. Unless this is really absolutely time critical (that is customers complain that your app is too slow and you measured it, and this function is the problem, in which case you can try some non-portable vector instructions, for example).
Often you can do things more efficiently by looking at them from a higher level. For example, if I write
for (n = 0; n < 1000000; ++n)
printf ("%lld\n", sum (100));
then this will take an awful long time (half a trillion additions) and can be done a lot quicker. Same if you change one element of the array A at a time and recalculate sum each time.
Suppose there are x elements of array A which are no larger than k and set B contains those elements which are larger than k and belongs to A.
Then the result of function sum(k) equals
k * x + sum_b
,where sum_b is the sum of elements belonging to B.
You can firstly sort the the array A, and calculate the array pre_A, where
pre_A[i] = pre_A[i - 1] + A[i] (i > 0),
or 0 (i = 0);
Then for each query k, use binary search on A to find the largest element u which is no larger than k. Assume the index of u is index_u, then sum(k) equals
k * index_u + pre_A[n] - pre_A[index_u]
. The time complex for each query is log(n).
In case array A may be dynamically changed, you can use BST to handle it.

C++: function creation using array

Write a function which has:
input: array of pairs (unique id and weight) length of N, K =< N
output: K random unique ids (from input array)
Note: being called many times frequency of appearing of some Id in the output should be greater the more weight it has.
Example: id with weight of 5 should appear in the output 5 times more often than id with weight of 1. Also, the amount of memory allocated should be known at compile time, i.e. no additional memory should be allocated.
My question is: how to solve this task?
thanks for responses everybody!
currently I can't understand how weight of pair affects frequency of appearance of pair in the output, can you give me more clear, "for dummy" explanation of how it works?
Assuming a good enough random number generator:
Sum the weights (total_weight)
Repeat K times:
Pick a number between 0 and total_weight (selection)
Find the first pair where the sum of all the weights from the beginning of the array to that pair is greater than or equal to selection
Write the first part of the pair to the output
You need enough storage to store the total weight.
Ok so you are given input as follows:
(3, 7)
(1, 2)
(2, 5)
(4, 1)
(5, 2)
And you want to pick a random number so that the weight of each id is reflected in the picking, i.e. pick a random number from the following list:
3 3 3 3 3 3 3 1 1 2 2 2 2 2 4 5 5
Initially, I created a temporary array but this can be done in memory as well, you can calculate the size of the list by summing all the weights up = X, in this example = 17
Pick a random number between [0, X-1], and calculate which which id should be returned by looping through the list, doing a cumulative addition on the weights. Say I have a random number 8
(3, 7) total = 7 which is < 8
(1, 2) total = 9 which is >= 8 **boom** 1 is your id!
Now since you need K random unique ids you can create a hashtable from initial array passed to you to work with. Once you find an id, remove it from the hash and proceed with algorithm. Edit Note that you create the hashmap initially only once! You algorithm will work on this instead of looking through the array. I did not put in in the top to keep the answer clear
As long as your random calculation is not using any extra memory secretly, you will need to store K random pickings, which are <= N and a copy of the original array so max space requirements at runtime are O(2*N)
Asymptotic runtime is :
O(n) : create copy of original array into hastable +
O(n) : calculate sum of weights +
O(1) : calculate random between range +
O(n) : cumulative totals
) * K random pickings
= O(n*k) overall
This is a good question :)
This solution works with non-integer weights and uses constant space (ie: space complexity = O(1)). It does, however modify the input array, but the only difference in the end is that the elements will be in a different order.
Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.
sum_weights = the sum of all of the weights, and n = N.
K times:
Choose a random number r in the range [0,sum_weights)
binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.
Add input[i].id to output.
Subtract input[i-1].weight from input[i].weight (unless i == 0). Now subtract input[i].weight from to following (> i) input weights and also sum_weight.
Move input[i] to position [n-1] (sliding the intervening elements down one slot). This is the expensive part, as it's O(N) and we do it K times. You can skip this step on the last iteration.
subtract 1 from n
Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight
Time complexity is O(K*N). The expensive part (of the time complexity) is shuffling the chosen elements. I suspect there's a clever way to avoid that, but haven't thought of anything yet.
It's unclear what the question means by "output: K random unique Ids". The solution above assumes that this meant that the output ids are supposed to be unique/distinct, but if that's not the case then the problem is even simpler:
Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.
sum_weights = the sum of all of the weights, and n = N.
K times:
Choose a random number r in the range [0,sum_weights)
binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.
Add input[i].id to output.
Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight
Time complexity is O(K*log(N)).
My short answer: in no way.
Just because the problem definition is incorrect. As Axn brilliantly noticed:
There is a little bit of contradiction going on in the requirement. It states that K <= N. But as K approaches N, the frequency requirement will be contradicted by the Uniqueness requirement. Worst case, if K=N, all elements will be returned (i.e appear with same frequency), irrespective of their weight.
Anyway, when K is pretty small relative to N, calculated frequencies will be pretty close to theoretical values.
The task may be splitted on two subtasks:
Generate random numbers with a given distribution (specified by weights)
Generate unique random numbers
Generate random numbers with a given distribution
Calculate sum of weights (sumOfWeights)
Generate random number from the range [1; sumOfWeights]
Find an array element where the sum of weights from the beginning of the array is greater than or equal to the generated random number
#include <iostream>
#include <cstdlib>
#include <ctime>
// 0 - id, 1 - weight
typedef unsigned Pair[2];
unsigned Random(Pair* i_set, unsigned* i_indexes, unsigned i_size)
unsigned sumOfWeights = 0;
for (unsigned i = 0; i < i_size; ++i)
const unsigned index = i_indexes[i];
sumOfWeights += i_set[index][2];
const unsigned random = rand() % sumOfWeights + 1;
sumOfWeights = 0;
unsigned i = 0;
for (; i < i_size; ++i)
const unsigned index = i_indexes[i];
sumOfWeights += i_set[index][3];
if (sumOfWeights >= random)
return i;
Generate unique random numbers
Well known Durstenfeld-Fisher-Yates algorithm may be used for generation unique random numbers. See this great explanation.
It requires N bytes of space, so if N value is defined at compiled time, we are able to allocate necessary space at compile time.
Now, we have to combine these two algorithms. We just need to use our own Random() function instead of standard rand() in unique numbers generation algorithm.
template<unsigned N, unsigned K>
void Generate(Pair (&i_set)[N], unsigned (&o_res)[K])
unsigned deck[N];
for (unsigned i = 0; i < N; ++i)
deck[i] = i;
unsigned max = N - 1;
for (unsigned i = 0; i < K; ++i)
const unsigned index = Random(i_set, deck, max + 1);
std::swap(deck[max], deck[index]);
o_res[i] = i_set[deck[max]][0];
int main()
const unsigned c_N = 5; // N
const unsigned c_K = 2; // K
Pair input[c_N] = {{0, 5}, {1, 3}, {2, 2}, {3, 5}, {4, 4}}; // input array
unsigned result[c_K] = {};
const unsigned c_total = 1000000; // number of iterations
unsigned counts[c_N] = {0}; // frequency counters
for (unsigned i = 0; i < c_total; ++i)
Generate<c_N, c_K>(input, result);
for (unsigned j = 0; j < c_K; ++j)
unsigned sumOfWeights = 0;
for (unsigned i = 0; i < c_N; ++i)
sumOfWeights += input[i][1];
for (unsigned i = 0; i < c_N; ++i)
std::cout << (double)counts[i]/c_K/c_total // empirical frequency
<< " | "
<< (double)input[i][1]/sumOfWeights // expected frequency
<< std::endl;
return 0;
N = 5, K = 2
Empiricical | Expected
0.253813 | 0.263158
0.16584 | 0.157895
0.113878 | 0.105263
0.253582 | 0.263158
0.212888 | 0.210526
Corner case when weights are actually ignored
N = 5, K = 5
Empiricical | Expected
0.2 | 0.263158
0.2 | 0.157895
0.2 | 0.105263
0.2 | 0.263158
0.2 | 0.210526
I do assume that the ids in the output must be unique. This makes this problem a specific instance of random sampling problems.
The first approach that I can think of solves this in O(N^2) time, using O(N) memory (The input array itself plus constant memory).
I Assume that the weights are possitive.
Let A be the array of pairs.
1) Set N to be A.length
2) calculate the sum of all weights W.
3) Loop K times
3.1) r = rand(0,W)
3.2) loop on A and find the first index i such that A[1].w + ...+ A[i].w <= r < A[1].w + ... + A[i+1].w
3.3) add A[i].id to output
3.4) A[i] = A[N-1] (or swap if the array contents should be preserved)
3.5) N = N - 1
3.6) W = W - A[i].w