Majority element - parts of an array - c++

I have an array, filled with integers. My job is to find majority element quickly for any part of an array, and I need to do it... log n time, not linear, but beforehand I can take some time to prepare the array.
For example:
1 5 2 7 7 7 8 4 6
And queries:
[4, 7] returns 7
[4, 8] returns 7
[1, 2] returns 0 (no majority element), and so on...
I need to have an answer for each query, if possible, it needs to execute fast.
For preparation, I can use O(n log n) time

O(log n) queries and O(n log n) preprocessing/space could be achieved by finding and using majority intervals with following properties:
For each value from input array there may be one or several majority intervals (or there may be none if elements with these values are too sparse; we don't need majority intervals of length 1 because they may be useful only for query intervals of size 1 which are better handled as a special case).
If query interval lies completely inside one of these majority intervals, corresponding value may be the majority element of this query interval.
If there is no majority interval completely containing query interval, corresponding value cannot be the majority element of this query interval.
Each element of input array is covered by O(log n) majority intervals.
In other words, the only purpose of majority intervals is to provide O(log n) majority element candidates for any query interval.
This algorithm uses following data structures:
List of positions for each value from input array (map<Value, vector<Position>>). Alternatively unordered_map may be used here to improve performance (but we'll need to extract all keys and sort them so that structure #3 is filled in proper order).
List of majority intervals for each value (vector<Interval>).
Data structure for handling queries (vector<small_map<Value, Data>>). Where Data contains two indexes of appropriate vector from structure #1 pointing to next/previous positions of elements with given value. Update: Thanks to #justhalf, it is better to store in Data cumulative frequencies of elements with given value. small_map may be implemented as sorted vector of pairs - preprocessing will append elements already in sorted order and query will use small_map only for linear search.
Preprocessing:
Scan input array and push current position to appropriate vector in structure #1.
Perform steps 3 .. 4 for every vector in structure #1.
Transform list of positions into list of majority intervals. See details below.
For each index of input array covered by one of majority intervals, insert data to appropriate element of structure #3: value and positions of previous/next elements with this value (or cumulative frequency of this value).
Query:
If query interval length is 1, return corresponding element of source array.
For starting point of query interval get corresponding element of 3rd structure's vector. For each element of the map perform step 3. Scan all elements of the map corresponding to ending point of query interval in parallel with this map to allow O(1) complexity for step 3 (instead of O(log log n)).
If the map corresponding to ending point of query interval contains matching value, compute s3[stop][value].prev - s3[start][value].next + 1. If it is greater than half of the query interval, return value. If cumulative frequencies are used instead of next/previous indexes, compute s3[stop+1][value].freq - s3[start][value].freq instead.
If nothing found on step 3, return "Nothing".
Main part of the algorithm is getting majority intervals from list of positions:
Assign weight to each position in the list: number_of_matching_values_to_the_left - number_of_nonmatching_values_to_the_left.
Filter only weights in strictly decreasing order (greedily) to the "prefix" array: for (auto x: positions) if (x < prefix.back()) prefix.push_back(x);.
Filter only weights in strictly increasing order (greedily, backwards) to the "suffix" array: reverse(positions); for (auto x: positions) if (x > suffix.back()) suffix.push_back(x);.
Scan "prefix" and "suffix" arrays together and find intervals from every "prefix" element to corresponding place in "suffix" array and from every "suffix" element to corresponding place in "prefix" array. (If all "suffix" elements' weights are less than given "prefix" element or their position is not to the right of it, no interval generated; if there is no "suffix" element with exactly the weight of given "prefix" element, get nearest "suffix" element with larger weight and extend interval with this weight difference to the right).
Merge overlapping intervals.
Properties 1 .. 3 for majority intervals are guaranteed by this algorithm. As for property #4, the only way I could imagine to cover some element with maximum number of majority intervals is like this: 11111111222233455666677777777. Here element 4 is covered by 2 * log n intervals, so this property seems to be satisfied. See more formal proof of this property at the end of this post.
Example:
For input array "0 1 2 0 0 1 1 0" the following lists of positions would be generated:
value positions
0 0 3 4 7
1 1 5 6
2 2
Positions for value 0 will get the following properties:
weights: 0:1 3:0 4:1 7:0
prefix: 0:1 3:0 (strictly decreasing)
suffix: 4:1 7:0 (strictly increasing when scanning backwards)
intervals: 0->4 3->7 4->0 7->3
merged intervals: 0-7
Positions for value 1 will get the following properties:
weights: 1:0 5:-2 6:-1
prefix: 1:0 5:-2
suffix: 1:0 6:-1
intervals: 1->none 5->6+1 6->5-1 1->none
merged intervals: 4-7
Query data structure:
positions value next prev
0 0 0 x
1..2 0 1 0
3 0 1 1
4 0 2 2
4 1 1 x
5 0 3 2
...
Query [0,4]:
prev[4][0]-next[0][0]+1=2-0+1=3
query size=5
3>2.5, returned result 0
Query [2,5]:
prev[5][0]-next[2][0]+1=2-1+1=2
query size=4
2=2, returned result "none"
Note that there is no attempt to inspect element "1" because its majority interval does not include either of these intervals.
Proof of property #4:
Majority intervals are constructed in such a way that strictly more than 1/3 of all their elements have corresponding value. This ratio is nearest to 1/3 for sub-arrays like any*(m-1) value*m any*m, for example, 01234444456789.
To make this proof more obvious, we could represent each interval as a point in 2D: every possible starting point represented by horizontal axis and every possible ending point represented by vertical axis (see diagram below).
All valid intervals are located on or above diagonal. White rectangle represents all intervals covering some array element (represented as unit-size interval on its lower right corner).
Let's cover this white rectangle with squares of size 1, 2, 4, 8, 16, ... sharing the same lower right corner. This divides white area into O(log n) areas similar to yellow one (and single square of size 1 containing single interval of size 1 which is ignored by this algorithm).
Let's count how many majority intervals may be placed into yellow area. One interval (located at the nearest to diagonal corner) occupies 1/4 of elements belonging to interval at the farthest from diagonal corner (and this largest interval contains all elements belonging to any interval in yellow area). This means that smallest interval contains strictly more than 1/12 values available for whole yellow area. So if we try to place 12 intervals to yellow area, we have not enough elements for different values. So yellow area cannot contain more than 11 majority intervals. And white rectangle cannot contain more than 11 * log n majority intervals. Proof completed.
11 * log n is overestimation. As I said earlier, it's hard to imagine more than 2 * log n majority intervals covering some element. And even this value is much greater than average number of covering majority intervals.
C++11 implementation. See it either at ideone or here:
#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
#include <functional>
#include <random>
constexpr int SrcSize = 1000000;
constexpr int NQueries = 100000;
using src_vec_t = std::vector<int>;
using index_vec_t = std::vector<int>;
using weight_vec_t = std::vector<int>;
using pair_vec_t = std::vector<std::pair<int, int>>;
using index_map_t = std::map<int, index_vec_t>;
using interval_t = std::pair<int, int>;
using interval_vec_t = std::vector<interval_t>;
using small_map_t = std::vector<std::pair<int, int>>;
using query_vec_t = std::vector<small_map_t>;
constexpr int None = -1;
constexpr int Junk = -2;
src_vec_t generate_e()
{ // good query length = 3
src_vec_t src;
std::random_device rd;
std::default_random_engine eng{rd()};
auto exp = std::bind(std::exponential_distribution<>{0.4}, eng);
for (int i = 0; i < SrcSize; ++i)
{
int x = exp();
src.push_back(x);
//std::cout << x << ' ';
}
return src;
}
src_vec_t generate_ep()
{ // good query length = 500
src_vec_t src;
std::random_device rd;
std::default_random_engine eng{rd()};
auto exp = std::bind(std::exponential_distribution<>{0.4}, eng);
auto poisson = std::bind(std::poisson_distribution<int>{100}, eng);
while (int(src.size()) < SrcSize)
{
int x = exp();
int n = poisson();
for (int i = 0; i < n; ++i)
{
src.push_back(x);
//std::cout << x << ' ';
}
}
return src;
}
src_vec_t generate()
{
//return generate_e();
return generate_ep();
}
int trivial(const src_vec_t& src, interval_t qi)
{
int count = 0;
int majorityElement = 0; // will be assigned before use for valid args
for (int i = qi.first; i <= qi.second; ++i)
{
if (count == 0)
majorityElement = src[i];
if (src[i] == majorityElement)
++count;
else
--count;
}
count = 0;
for (int i = qi.first; i <= qi.second; ++i)
{
if (src[i] == majorityElement)
count++;
}
if (2 * count > qi.second + 1 - qi.first)
return majorityElement;
else
return None;
}
index_map_t sort_ind(const src_vec_t& src)
{
int ind = 0;
index_map_t im;
for (auto x: src)
im[x].push_back(ind++);
return im;
}
weight_vec_t get_weights(const index_vec_t& indexes)
{
weight_vec_t weights;
for (int i = 0; i != int(indexes.size()); ++i)
weights.push_back(2 * i - indexes[i]);
return weights;
}
pair_vec_t get_prefix(const index_vec_t& indexes, const weight_vec_t& weights)
{
pair_vec_t prefix;
for (int i = 0; i != int(indexes.size()); ++i)
if (prefix.empty() || weights[i] < prefix.back().second)
prefix.emplace_back(indexes[i], weights[i]);
return prefix;
}
pair_vec_t get_suffix(const index_vec_t& indexes, const weight_vec_t& weights)
{
pair_vec_t suffix;
for (int i = indexes.size() - 1; i >= 0; --i)
if (suffix.empty() || weights[i] > suffix.back().second)
suffix.emplace_back(indexes[i], weights[i]);
std::reverse(suffix.begin(), suffix.end());
return suffix;
}
interval_vec_t get_intervals(const pair_vec_t& prefix, const pair_vec_t& suffix)
{
interval_vec_t intervals;
int prev_suffix_index = 0; // will be assigned before use for correct args
int prev_suffix_weight = 0; // same assumptions
for (int ind_pref = 0, ind_suff = 0; ind_pref != int(prefix.size());)
{
auto i_pref = prefix[ind_pref].first;
auto w_pref = prefix[ind_pref].second;
if (ind_suff != int(suffix.size()))
{
auto i_suff = suffix[ind_suff].first;
auto w_suff = suffix[ind_suff].second;
if (w_pref <= w_suff)
{
auto beg = std::max(0, i_pref + w_pref - w_suff);
if (i_pref < i_suff)
intervals.emplace_back(beg, i_suff + 1);
if (w_pref == w_suff)
++ind_pref;
++ind_suff;
prev_suffix_index = i_suff;
prev_suffix_weight = w_suff;
continue;
}
}
// ind_suff out of bounds or w_pref > w_suff:
auto end = prev_suffix_index + prev_suffix_weight - w_pref + 1;
// end may be out-of-bounds; that's OK if overflow is not possible
intervals.emplace_back(i_pref, end);
++ind_pref;
}
return intervals;
}
interval_vec_t merge(const interval_vec_t& from)
{
using endpoints_t = std::vector<std::pair<int, bool>>;
endpoints_t ep(2 * from.size());
std::transform(from.begin(), from.end(), ep.begin(),
[](interval_t x){ return std::make_pair(x.first, true); });
std::transform(from.begin(), from.end(), ep.begin() + from.size(),
[](interval_t x){ return std::make_pair(x.second, false); });
std::sort(ep.begin(), ep.end());
interval_vec_t to;
int start; // will be assigned before use for correct args
int overlaps = 0;
for (auto& x: ep)
{
if (x.second) // begin
{
if (overlaps++ == 0)
start = x.first;
}
else // end
{
if (--overlaps == 0)
to.emplace_back(start, x.first);
}
}
return to;
}
interval_vec_t get_intervals(const index_vec_t& indexes)
{
auto weights = get_weights(indexes);
auto prefix = get_prefix(indexes, weights);
auto suffix = get_suffix(indexes, weights);
auto intervals = get_intervals(prefix, suffix);
return merge(intervals);
}
void update_qv(
query_vec_t& qv,
int value,
const interval_vec_t& intervals,
const index_vec_t& iv)
{
int iv_ind = 0;
int qv_ind = 0;
int accum = 0;
for (auto& interval: intervals)
{
int i_begin = interval.first;
int i_end = std::min<int>(interval.second, qv.size() - 1);
while (iv[iv_ind] < i_begin)
{
++accum;
++iv_ind;
}
qv_ind = std::max(qv_ind, i_begin);
while (qv_ind <= i_end)
{
qv[qv_ind].emplace_back(value, accum);
if (iv[iv_ind] == qv_ind)
{
++accum;
++iv_ind;
}
++qv_ind;
}
}
}
void print_preprocess_stat(const index_map_t& im, const query_vec_t& qv)
{
double sum_coverage = 0.;
int max_coverage = 0;
for (auto& x: qv)
{
sum_coverage += x.size();
max_coverage = std::max<int>(max_coverage, x.size());
}
std::cout << " size = " << qv.size() - 1 << '\n';
std::cout << " values = " << im.size() << '\n';
std::cout << " max coverage = " << max_coverage << '\n';
std::cout << " avg coverage = " << sum_coverage / qv.size() << '\n';
}
query_vec_t preprocess(const src_vec_t& src)
{
query_vec_t qv(src.size() + 1);
auto im = sort_ind(src);
for (auto& val: im)
{
auto intervals = get_intervals(val.second);
update_qv(qv, val.first, intervals, val.second);
}
print_preprocess_stat(im, qv);
return qv;
}
int do_query(const src_vec_t& src, const query_vec_t& qv, interval_t qi)
{
if (qi.first == qi.second)
return src[qi.first];
auto b = qv[qi.first].begin();
auto e = qv[qi.second + 1].begin();
while (b != qv[qi.first].end() && e != qv[qi.second + 1].end())
{
if (b->first < e->first)
{
++b;
}
else if (e->first < b->first)
{
++e;
}
else // if (e->first == b->first)
{
// hope this doesn't overflow
if (2 * (e->second - b->second) > qi.second + 1 - qi.first)
return b->first;
++b;
++e;
}
}
return None;
}
int main()
{
std::random_device rd;
std::default_random_engine eng{rd()};
auto poisson = std::bind(std::poisson_distribution<int>{500}, eng);
int majority = 0;
int nonzero = 0;
int failed = 0;
auto src = generate();
auto qv = preprocess(src);
for (int i = 0; i < NQueries; ++i)
{
int size = poisson();
auto ud = std::uniform_int_distribution<int>(0, src.size() - size - 1);
int start = ud(eng);
int stop = start + size;
auto res1 = do_query(src, qv, {start, stop});
auto res2 = trivial(src, {start, stop});
//std::cout << size << ": " << res1 << ' ' << res2 << '\n';
if (res2 != res1)
++failed;
if (res2 != None)
{
++majority;
if (res2 != 0)
++nonzero;
}
}
std::cout << "majority elements = " << 100. * majority / NQueries << "%\n";
std::cout << " nonzero elements = " << 100. * nonzero / NQueries << "%\n";
std::cout << " queries = " << NQueries << '\n';
std::cout << " failed = " << failed << '\n';
return 0;
}
Related work:
As pointed in other answer to this question, there is other work where this problem is already solved: "Range majority in constant time and linear space" by S. Durocher, M. He, I Munro, P.K. Nicholson, M. Skala.
Algorithm presented in this paper has better asymptotic complexities for query time: O(1) instead of O(log n) and for space: O(n) instead of O(n log n).
Better space complexity allows this algorithm to process larger data sets (comparing to the algorithm proposed in this answer). Less memory needed for preprocessed data and more regular data access pattern, most likely, allow this algorithm to preprocess data more quickly. But it is not so easy with query time...
Let's suppose we have input data most favorable to algorithm from the paper: n=1000000000 (it's hard to imagine a system with more than 10..30 gigabytes of memory, in year 2013).
Algorithm proposed in this answer needs to process up to 120 (or 2 query boundaries * 2 * log n) elements for each query. But it performs very simple operations, similar to linear search. And it sequentially accesses two contiguous memory areas, so it is cache-friendly.
Algorithm from the paper needs to perform up to 20 operations (or 2 query boundaries * 5 candidates * 2 wavelet tree levels) for each query. This is 6 times less. But each operation is more complex. Each query for succinct representation of bit counters itself contains a linear search (which means 20 linear searches instead of one). Worst of all, each such operation should access several independent memory areas (unless query size and therefore quadruple size is very small), so query is cache-unfriendly. Which means each query (while is a constant-time operation) is pretty slow, probably slower than in algorithm proposed here. If we decrease input array size, increased are the chances that proposed here algorithm is quicker.
Practical disadvantage of algorithm in the paper is wavelet tree and succinct bit counter implementation. Implementing them from scratch may be pretty time consuming. Using a pre-existing implementation is not always convenient.

the trick
When looking for a majority element, you may discard intervals that do not have a majority element. See Find the majority element in array. This allows you to solve this quite simply.
preparation
At preparation time, recursively keep dividing the array into two halves and store these array intervals in a binary tree. For each node, count the occurrence of each element in the array interval. You need a data structure that offers O(1) inserts and reads. I suggest using an unsorted_multiset, which on average behaves as needed (but worst case inserts are linear). Also check if the interval has a majority element and store it if it does.
runtime
At runtime, when asked to compute the majority element for a range, dive into the tree to compute the set of intervals that covers the given range exactly. Use the trick to combine these intervals.
If we have array interval 7 5 5 7 7 7, with majority element 7, we can split off and discard 5 5 7 7 since it has no majority element. Effectively the fives have gobbled up two of the sevens. What's left is an array 7 7, or 2x7. Call this number 2 the majority count of the majority element 7:
The majority count of a majority element of an array interval is the
occurrence count of the majority element minus the combined occurrence
of all other elements.
Use the following rules to combine intervals to find the potential majority element:
Discard the intervals that have no majority element
Combining two arrays with the same majority element is easy, just add up the element's majority counts. 2x7 and 3x7 become 5x7
When combining two arrays with different majority elements, the higher majority count wins. Subtract the lower majority count from the higher to find the resulting majority count. 3x7 and 2x3 become 1x7.
If their majority elements are different but have have equal majority counts, disregard both arrays. 3x7 and 3x5 cancel each other out.
When all intervals have been either discarded or combined, you are either left with nothing, in which case there is no majority element. Or you have one combined interval containing a potential majority element. Lookup and add this element's occurrence counts in all array intervals (also the previously discarded ones) to check if it really is the majority element.
example
For the array 1,1,1,2,2,3,3,2,2,2,3,2,2, you get the tree (majority count x majority element listed in brackets)
1,1,1,2,2,3,3,2,2,2,3,2,2
(1x2)
/ \
1,1,1,2,2,3,3 2,2,2,3,2,2
(4x2)
/ \ / \
1,1,1,2 2,3,3 2,2,2 3,2,2
(2x1) (1x3) (3x2) (1x2)
/ \ / \ / \ / \
1,1 1,2 2,3 3 2,2 2 3,2 2
(1x1) (1x3) (2x2) (1x2) (1x2)
/ \ / \ / \ / \ / \
1 1 1 2 2 3 2 2 3 2
(1x1) (1x1)(1x1)(1x2)(1x2)(1x3) (1x2)(1x2) (1x3) (1x2)
Range [5,10] (1-indexed) is covered by the set of intervals 2,3,3 (1x3), 2,2,2 (3x2). They have different majority elements. Subtract their majority counts, you're left with 2x2. So 2 is the potential majority element. Lookup and sum the actual occurrence counts of 2 in the arrays: 1+3 = 4 out of 6. 2 is the majority element.
Range [1,10] is covered by the set of intervals 1,1,1,2,2,3,3 (no majority element) and 2,2,2 (3x2). Disregard the first interval since it has no majority element, so 2 is the potential majority element. Sum the occurrence counts of 2 in all intervals: 2+3 = 5 out of 10. There is no majority element.

Actually, it can be done in constant time and linear space(!)
See https://cs.stackexchange.com/questions/16671/range-majority-queries-most-freqent-element-in-range and S. Durocher, M. He, I Munro, P.K. Nicholson, M. Skala, Range majority in constant time and linear space, Information and Computation 222 (2013) 169–179, Elsevier.
Their preparation time is O(n log n), the space needed is O(n) and queries are O(1). It is a theoretical paper and I don't claim to understand all of it but it seems far from impossible to implement. They're using wavelet trees.
For an implementation of wavelet trees, see https://github.com/fclaude/libcds

If you have unlimited memory you can and limited data range (like short int) do it even in O(N) time.
Go through array and count number of 1s, 2s, 3s, eta (number of entries for each value you have in array). You will need additional array X with sizeof(YouType) elements for this.
Go through array X and find maximum.
In total O(1) + O(N) operations.
Also you can limit yourself with O(N) memory, if you use map instead of array X.
But then you will need to find element on each iteration at stage 1. Therefore you will need O(N*log(N)) time in total.

You can use MAX Heap, with frequency of number as a deciding factor for Keeping Max Heap property,
I meant, e.g. for following input array
1 5 2 7 7 7 8 4 6 5
Heap would have all distinct elements with their frequency associated with them
Element = 1 Frequency = 1,
Element = 5 Frequency = 2,
Element = 2 Frequency = 1,
Element = 7 Frequency = 3,
Element = 8 Frequency = 1,
Element = 4 Frequency = 1,
Element = 6 Frequency = 1
As its MAX heap, Element 7 with frequency 3 would be at the root level,
Just check whether input range contains this element, if yes then this is the answer if no, then go to left subtree or right subtree as per input range and perform same checks.
O(N) would be required only once while creating a heap, but once its created, searching will be efficient.

Edit: Sorry, I was solving a different problem.
Sort the array and build an ordered list of pairs (value, number_of_occurrences) - it's O(N log N). Starting with
1 5 2 7 7 7 8 4 6
it will be
(1,1) (2,1) (4,1) (5,1) (6,1) (7,3) (8,1)
On top of this array, build a binary tree with pairs (best_value_or_none, max_occurrences). It will look like:
(1,1) (2,1) (4,1) (5,1) (6,1) (7,3) (8,1)
\ / \ / \ / |
(0,1) (0,1) (7,3) (8,1)
\ / \ /
(0,1) (7,3)
\ /
(7,3)
This structure definitely has a fancy name, but I don't remember it :)
From here, it's O(log N) to fetch the mode of any interval. Any interval can be split into O(log N) precomputed intervals; for example:
[4, 7] = [4, 5] + [6, 7]
f([4,5]) = (0,1)
f([6,7]) = (7,3)
and the result is (7,3).

Related

find difference between max and min element within a range of an array

Say I got following array
a = [4 5 2 1 3 4]
and I want to find the difference between two elements (the max and min) excluding some consecutive elements.
For example, excluding the 2nd and 3rd so that I need to find the difference of max/min of:
a = [4 1 3 4]
which in this case is
diff = 4-1
Now I am looking for an efficient algorithm that can do this over and over.
I was considering having a prefix and suffix but not sure how to move forward
So we have an array a of size N and M queries of this form: "Excluding the interval [L, R], what's the difference between the max and min element in a?". Consider we have an efficient way to query the min and max value on an arbitrary interval [X, Y], then we can use the following algorithm:
Query min/max on the interval [0, L)
Query min/max on the interval (R, N)
Combine the min and max from both intervals
Subtract the two
Later edit: I initially proposed a solution much more complicated than necessary. I will leave it if you are curious about other approaches, but here it is a simpler one.
As we know that we will always query only intervals of form [0, something) and (something, N), we could precompute the min/max values in linear time. Consider storing them like this:
min_from_left[i] = minimum item considering only items 0..i
min_from_right[i] = minimum item considering only items i..N
same for max
Now, the minimum value excluding the interval [L, R] is min(min_from_left[L-1], min_from_right[R+1]) (I omitted the bound checks). Thus, you can achieve O(N) for precomputing and O(1) per query.
The generic (but unnecessary approach)
To find the min/max value in a given interval you have plenty of options, it just depends what are your time and memory constraints and if you need updates or not:
Segment trees (support updates)
Sparse tables (does not support updates)
Generally, you can obtain O(N + M * log N) time complexity if you need updates or O(N log N + M) if you don't need them.
See more options here https://cp-algorithms.com/sequences/rmq.html
Just scan the array. Store min and max values while ignoring ones that are excluded.
Does this work?
#include <iostream>
using namespace std;
int main()
{
int arr[6] = { 4, 5, 2, 1, 3, 4 };
int max = arr[0], min = arr[0];
int first, second;
cout << "Enter Range (first value): ";
cin >> first; // in your example, first = 1
cout << "Enter Range (last value): ";
cin >> second; // second = 2, in your example.
for (int i = 0; i < 6; i++)
{
if (i >= first && i <= second)
continue;
if (arr[i] < min)
min = arr[i];
if (arr[i] > max)
max = arr[i];
}
cout << "Difference between max and min is " << max - min << endl;
return 0;
}

How to erase elements more efficiently from a vector or set?

Problem statement:
Input:
First two inputs are integers n and m. n is the number of knights fighting in the tournament (2 <= n <= 100000, 1 <= m <= n-1). m is the number of battles that will take place.
The next line contains n power levels.
The next m lines contain two integers l and r, indicating the range of knight positions to compete in the ith battle.
After each battle, all nights apart from the one with the highest power level will be eliminated.
The range for each battle is given in terms of the new positions of the knights, not the original positions.
Output:
Output m lines, the ith line containing the original positions (indices) of the knights from that battle. Each line is in ascending order.
Sample Input:
8 4
1 0 5 6 2 3 7 4
1 3
2 4
1 3
0 1
Sample Output:
1 2
4 5
3 7
0
Here is a visualisation of this process.
1 2
[(1,0),(0,1),(5,2),(6,3),(2,4),(3,5),(7,6),(4,7)]
-----------------
4 5
[(1,0),(6,3),(2,4),(3,5),(7,6),(4,7)]
-----------------
3 7
[(1,0),(6,3),(7,6),(4,7)]
-----------------
0
[(1,0),(7,6)]
-----------
[(7,6)]
I have solved this problem. My program produces the correct output, however, it is O(n*m) = O(n^2). I believe that if I erase knights more efficiently from the vector, efficiency can be increased. Would it be more efficient to erase elements using a set? I.e. erase contiguous segments rather that individual knights. Is there an alternative way to do this that is more efficient?
#define INPUT1(x) scanf("%d", &x)
#define INPUT2(x, y) scanf("%d%d", &x, &y)
#define OUTPUT1(x) printf("%d\n", x);
int main(int argc, char const *argv[]) {
int n, m;
INPUT2(n, m);
vector< pair<int,int> > knights(n);
for (int i = 0; i < n; i++) {
int power;
INPUT(power);
knights[i] = make_pair(power, i);
}
while(m--) {
int l, r;
INPUT2(l, r);
int max_in_range = knights[l].first;
for (int i = l+1; i <= r; i++) if (knights[i].first > max_in_range) {
max_in_range = knights[i].first;
}
int offset = l;
int range = r-l+1;
while (range--) {
if (knights[offset].first != max_in_range) {
OUTPUT1(knights[offset].second));
knights.erase(knights.begin()+offset);
}
else offset++;
}
printf("\n");
}
}
Well, removing from vector wouldn't be efficient for sure. Removing from set, or unordered set would be more effective (use iterators instead of indexes).
Yet the problem will still remain O(n^2), because you have two nested whiles running n*m times.
--EDIT--
I believe I understand the question now :)
First let's calculate the complexity of your code above. Your worst case would be the case that max range in all battles is 1 (two nights for each battle) and the battles are not ordered with respect to the position. Which means you have m battles (in this case m = n-1 ~= O(n))
The first while loop runs n times
For runs for once every time which makes it n*1 = n in total
The second while loop runs once every time which makes it n again.
Deleting from vector means n-1 shifts that makes it O(n).
Thus with the complexity of the vector total complexity is O(n^2)
First of all, you don't really need the inner for loop. Take the first knight as the max in range, compare the rest in the range one-by-one and remove the defeated ones.
Now, i believe it can be done in O(nlogn) with using std::map. The key to the map is the position and the value is the level of the knight.
Before proceeding, finding and removing an element in map is logarithmic, iterating is constant.
Finally, your code should look like:
while(m--) // n times
strongest = map.find(first_position); // find is log(n) --> n*log(n)
for (opponent = next of strongest; // this will run 1 times, since every range is 1
opponent in range;
opponent = next opponent) // iterating is constant
// removing from map is log(n) --> n * 1 * log(n)
if strongest < opponent
remove strongest, opponent is the new strongest
else
remove opponent, (be careful to remove it after iterating to next)
Ok, now the upper bound would be O(2*nlogn) = O(nlogn). If the ranges increases, that makes the run time of upper loop decrease but increases the number of remove operations. I'm sure the upper bound won't change, let's make it a homework for you to calculate :)
A solution with a treap is pretty straightforward.
For each query, you need to split the treap by implicit key to obtain the subtree that corresponds to the [l, r] range (it takes O(log n) time).
After that, you can iterate over the subtree and find the knight with the maximum strength. After that, you just need to merge the [0, l) and [r + 1, end) parts of the treap with the node that corresponds to this knight.
It's clear that all parts of the solution except for the subtree traversal and printing work in O(log n) time per query. However, each operation reinserts only one knight and erase the rest from the range, so the size of the output (and the sum of sizes of subtrees) is linear in n. So the total time complexity is O(n log n).
I don't think you can solve with standard stl containers because there'no standard container that supports getting an iterator by index quickly and removing arbitrary elements.

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.
The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.
I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.
If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:
assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
if (rnd(engine) * (n - i) < k) {
std::cout << i << std::endl;
k--;
}
}
It produces all ascending sequences with equal probability.
You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:
int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint.
{
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(0.0,1.0);
int successes = 0;
for(int trial = 0; trial < n; trial++)
{
if((int)(distribution(generator) * N) < K)
{
successes++;
K--;
}
N--;
}
return successes;
}
select_k_from_n(int start, int k, int n)
{
if(k == 0)
return;
if(k == 1)
{
output start + random(1 to n);
return;
}
// find the number of results below the mid-point:
int k1 = sample_hypergeometric(k, n >> 1, n);
select_k_from_n(start, k1, n >> 1);
select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
}
Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).
As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:
#include <random>
#include <set>
int main(void) {
std::set<int> random_set;
std::random_device rd;
std::mt19937 mt_eng(rd());
// min and max of random set range
const int m = 0; // min
const int n = 100; // max
std::uniform_int_distribution<> dist(m,n);
// number to generate
const int k = 50;
for (int i = 0; i < k; ++i) {
// only non-previously occurring values will be inserted
if (!random_set.insert(dist(mt_eng)).second)
--i;
}
}
Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.
Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.
An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.
So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.
Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.
This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.
This pseudocode shows the basic idea.
pick_k_from_n(n, k)
{
num_left = k
last_k = 0;
while num_left > 0
{
// divide the remaining range into num_left partitions
range_size = (n - last_k) / num_left
// pick a number in the first partition
r = random(range_size) + last_k + 1
output(r)
last_k = r
num_left = num_left - 1
}
}
Note that this takes O(k) time and requires O(1) extra space.
You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).
To select k items from among n:
for j = n-k+1 to n:
select random x from 1 to j
if x is already in hash:
insert j into hash
else
insert x into hash
That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).
Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?
IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?
Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.
For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.
# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
counts = [0] * b # list of b zeros
for j in range(n - k, n):
t = random.randrange(j + 1)
if t // b >= counts[t % b]: # intuitively, "t is not in the set"
counts[t % b] += 1
else:
counts[j % b] += 1
return counts

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?
Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.
If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.
You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.

C++: function creation using array

Write a function which has:
input: array of pairs (unique id and weight) length of N, K =< N
output: K random unique ids (from input array)
Note: being called many times frequency of appearing of some Id in the output should be greater the more weight it has.
Example: id with weight of 5 should appear in the output 5 times more often than id with weight of 1. Also, the amount of memory allocated should be known at compile time, i.e. no additional memory should be allocated.
My question is: how to solve this task?
EDIT
thanks for responses everybody!
currently I can't understand how weight of pair affects frequency of appearance of pair in the output, can you give me more clear, "for dummy" explanation of how it works?
Assuming a good enough random number generator:
Sum the weights (total_weight)
Repeat K times:
Pick a number between 0 and total_weight (selection)
Find the first pair where the sum of all the weights from the beginning of the array to that pair is greater than or equal to selection
Write the first part of the pair to the output
You need enough storage to store the total weight.
Ok so you are given input as follows:
(3, 7)
(1, 2)
(2, 5)
(4, 1)
(5, 2)
And you want to pick a random number so that the weight of each id is reflected in the picking, i.e. pick a random number from the following list:
3 3 3 3 3 3 3 1 1 2 2 2 2 2 4 5 5
Initially, I created a temporary array but this can be done in memory as well, you can calculate the size of the list by summing all the weights up = X, in this example = 17
Pick a random number between [0, X-1], and calculate which which id should be returned by looping through the list, doing a cumulative addition on the weights. Say I have a random number 8
(3, 7) total = 7 which is < 8
(1, 2) total = 9 which is >= 8 **boom** 1 is your id!
Now since you need K random unique ids you can create a hashtable from initial array passed to you to work with. Once you find an id, remove it from the hash and proceed with algorithm. Edit Note that you create the hashmap initially only once! You algorithm will work on this instead of looking through the array. I did not put in in the top to keep the answer clear
As long as your random calculation is not using any extra memory secretly, you will need to store K random pickings, which are <= N and a copy of the original array so max space requirements at runtime are O(2*N)
Asymptotic runtime is :
O(n) : create copy of original array into hastable +
(
O(n) : calculate sum of weights +
O(1) : calculate random between range +
O(n) : cumulative totals
) * K random pickings
= O(n*k) overall
This is a good question :)
This solution works with non-integer weights and uses constant space (ie: space complexity = O(1)). It does, however modify the input array, but the only difference in the end is that the elements will be in a different order.
Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.
sum_weights = the sum of all of the weights, and n = N.
K times:
Choose a random number r in the range [0,sum_weights)
binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.
Add input[i].id to output.
Subtract input[i-1].weight from input[i].weight (unless i == 0). Now subtract input[i].weight from to following (> i) input weights and also sum_weight.
Move input[i] to position [n-1] (sliding the intervening elements down one slot). This is the expensive part, as it's O(N) and we do it K times. You can skip this step on the last iteration.
subtract 1 from n
Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight
Time complexity is O(K*N). The expensive part (of the time complexity) is shuffling the chosen elements. I suspect there's a clever way to avoid that, but haven't thought of anything yet.
Update
It's unclear what the question means by "output: K random unique Ids". The solution above assumes that this meant that the output ids are supposed to be unique/distinct, but if that's not the case then the problem is even simpler:
Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.
sum_weights = the sum of all of the weights, and n = N.
K times:
Choose a random number r in the range [0,sum_weights)
binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.
Add input[i].id to output.
Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight
Time complexity is O(K*log(N)).
My short answer: in no way.
Just because the problem definition is incorrect. As Axn brilliantly noticed:
There is a little bit of contradiction going on in the requirement. It states that K <= N. But as K approaches N, the frequency requirement will be contradicted by the Uniqueness requirement. Worst case, if K=N, all elements will be returned (i.e appear with same frequency), irrespective of their weight.
Anyway, when K is pretty small relative to N, calculated frequencies will be pretty close to theoretical values.
The task may be splitted on two subtasks:
Generate random numbers with a given distribution (specified by weights)
Generate unique random numbers
Generate random numbers with a given distribution
Calculate sum of weights (sumOfWeights)
Generate random number from the range [1; sumOfWeights]
Find an array element where the sum of weights from the beginning of the array is greater than or equal to the generated random number
Code
#include <iostream>
#include <cstdlib>
#include <ctime>
// 0 - id, 1 - weight
typedef unsigned Pair[2];
unsigned Random(Pair* i_set, unsigned* i_indexes, unsigned i_size)
{
unsigned sumOfWeights = 0;
for (unsigned i = 0; i < i_size; ++i)
{
const unsigned index = i_indexes[i];
sumOfWeights += i_set[index][2];
}
const unsigned random = rand() % sumOfWeights + 1;
sumOfWeights = 0;
unsigned i = 0;
for (; i < i_size; ++i)
{
const unsigned index = i_indexes[i];
sumOfWeights += i_set[index][3];
if (sumOfWeights >= random)
{
break;
}
}
return i;
}
Generate unique random numbers
Well known Durstenfeld-Fisher-Yates algorithm may be used for generation unique random numbers. See this great explanation.
It requires N bytes of space, so if N value is defined at compiled time, we are able to allocate necessary space at compile time.
Now, we have to combine these two algorithms. We just need to use our own Random() function instead of standard rand() in unique numbers generation algorithm.
Code
template<unsigned N, unsigned K>
void Generate(Pair (&i_set)[N], unsigned (&o_res)[K])
{
unsigned deck[N];
for (unsigned i = 0; i < N; ++i)
{
deck[i] = i;
}
unsigned max = N - 1;
for (unsigned i = 0; i < K; ++i)
{
const unsigned index = Random(i_set, deck, max + 1);
std::swap(deck[max], deck[index]);
o_res[i] = i_set[deck[max]][0];
--max;
}
}
Usage
int main()
{
srand((unsigned)time(0));
const unsigned c_N = 5; // N
const unsigned c_K = 2; // K
Pair input[c_N] = {{0, 5}, {1, 3}, {2, 2}, {3, 5}, {4, 4}}; // input array
unsigned result[c_K] = {};
const unsigned c_total = 1000000; // number of iterations
unsigned counts[c_N] = {0}; // frequency counters
for (unsigned i = 0; i < c_total; ++i)
{
Generate<c_N, c_K>(input, result);
for (unsigned j = 0; j < c_K; ++j)
{
++counts[result[j]];
}
}
unsigned sumOfWeights = 0;
for (unsigned i = 0; i < c_N; ++i)
{
sumOfWeights += input[i][1];
}
for (unsigned i = 0; i < c_N; ++i)
{
std::cout << (double)counts[i]/c_K/c_total // empirical frequency
<< " | "
<< (double)input[i][1]/sumOfWeights // expected frequency
<< std::endl;
}
return 0;
}
Output
N = 5, K = 2
Frequencies
Empiricical | Expected
0.253813 | 0.263158
0.16584 | 0.157895
0.113878 | 0.105263
0.253582 | 0.263158
0.212888 | 0.210526
Corner case when weights are actually ignored
N = 5, K = 5
Frequencies
Empiricical | Expected
0.2 | 0.263158
0.2 | 0.157895
0.2 | 0.105263
0.2 | 0.263158
0.2 | 0.210526
I do assume that the ids in the output must be unique. This makes this problem a specific instance of random sampling problems.
The first approach that I can think of solves this in O(N^2) time, using O(N) memory (The input array itself plus constant memory).
I Assume that the weights are possitive.
Let A be the array of pairs.
1) Set N to be A.length
2) calculate the sum of all weights W.
3) Loop K times
3.1) r = rand(0,W)
3.2) loop on A and find the first index i such that A[1].w + ...+ A[i].w <= r < A[1].w + ... + A[i+1].w
3.3) add A[i].id to output
3.4) A[i] = A[N-1] (or swap if the array contents should be preserved)
3.5) N = N - 1
3.6) W = W - A[i].w