C++: function creation using array - c++

Write a function which has:
input: array of pairs (unique id and weight) length of N, K =< N
output: K random unique ids (from input array)
Note: being called many times frequency of appearing of some Id in the output should be greater the more weight it has.
Example: id with weight of 5 should appear in the output 5 times more often than id with weight of 1. Also, the amount of memory allocated should be known at compile time, i.e. no additional memory should be allocated.
My question is: how to solve this task?
EDIT
thanks for responses everybody!
currently I can't understand how weight of pair affects frequency of appearance of pair in the output, can you give me more clear, "for dummy" explanation of how it works?

Assuming a good enough random number generator:
Sum the weights (total_weight)
Repeat K times:
Pick a number between 0 and total_weight (selection)
Find the first pair where the sum of all the weights from the beginning of the array to that pair is greater than or equal to selection
Write the first part of the pair to the output
You need enough storage to store the total weight.

Ok so you are given input as follows:
(3, 7)
(1, 2)
(2, 5)
(4, 1)
(5, 2)
And you want to pick a random number so that the weight of each id is reflected in the picking, i.e. pick a random number from the following list:
3 3 3 3 3 3 3 1 1 2 2 2 2 2 4 5 5
Initially, I created a temporary array but this can be done in memory as well, you can calculate the size of the list by summing all the weights up = X, in this example = 17
Pick a random number between [0, X-1], and calculate which which id should be returned by looping through the list, doing a cumulative addition on the weights. Say I have a random number 8
(3, 7) total = 7 which is < 8
(1, 2) total = 9 which is >= 8 **boom** 1 is your id!
Now since you need K random unique ids you can create a hashtable from initial array passed to you to work with. Once you find an id, remove it from the hash and proceed with algorithm. Edit Note that you create the hashmap initially only once! You algorithm will work on this instead of looking through the array. I did not put in in the top to keep the answer clear
As long as your random calculation is not using any extra memory secretly, you will need to store K random pickings, which are <= N and a copy of the original array so max space requirements at runtime are O(2*N)
Asymptotic runtime is :
O(n) : create copy of original array into hastable +
(
O(n) : calculate sum of weights +
O(1) : calculate random between range +
O(n) : cumulative totals
) * K random pickings
= O(n*k) overall
This is a good question :)

This solution works with non-integer weights and uses constant space (ie: space complexity = O(1)). It does, however modify the input array, but the only difference in the end is that the elements will be in a different order.
Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.
sum_weights = the sum of all of the weights, and n = N.
K times:
Choose a random number r in the range [0,sum_weights)
binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.
Add input[i].id to output.
Subtract input[i-1].weight from input[i].weight (unless i == 0). Now subtract input[i].weight from to following (> i) input weights and also sum_weight.
Move input[i] to position [n-1] (sliding the intervening elements down one slot). This is the expensive part, as it's O(N) and we do it K times. You can skip this step on the last iteration.
subtract 1 from n
Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight
Time complexity is O(K*N). The expensive part (of the time complexity) is shuffling the chosen elements. I suspect there's a clever way to avoid that, but haven't thought of anything yet.
Update
It's unclear what the question means by "output: K random unique Ids". The solution above assumes that this meant that the output ids are supposed to be unique/distinct, but if that's not the case then the problem is even simpler:
Add the weight of each input to the weight of the following input, starting from the bottom working your way up. Now each weight is actually the sum of that input's weight and all of the previous weights.
sum_weights = the sum of all of the weights, and n = N.
K times:
Choose a random number r in the range [0,sum_weights)
binary search the first n elements for the first slot where the (now summed) weight is greater than or equal to r, i.
Add input[i].id to output.
Fix back all of the weights from n-1 down to 1 by subtracting the preceding input's weight
Time complexity is O(K*log(N)).

My short answer: in no way.
Just because the problem definition is incorrect. As Axn brilliantly noticed:
There is a little bit of contradiction going on in the requirement. It states that K <= N. But as K approaches N, the frequency requirement will be contradicted by the Uniqueness requirement. Worst case, if K=N, all elements will be returned (i.e appear with same frequency), irrespective of their weight.
Anyway, when K is pretty small relative to N, calculated frequencies will be pretty close to theoretical values.
The task may be splitted on two subtasks:
Generate random numbers with a given distribution (specified by weights)
Generate unique random numbers
Generate random numbers with a given distribution
Calculate sum of weights (sumOfWeights)
Generate random number from the range [1; sumOfWeights]
Find an array element where the sum of weights from the beginning of the array is greater than or equal to the generated random number
Code
#include <iostream>
#include <cstdlib>
#include <ctime>
// 0 - id, 1 - weight
typedef unsigned Pair[2];
unsigned Random(Pair* i_set, unsigned* i_indexes, unsigned i_size)
{
unsigned sumOfWeights = 0;
for (unsigned i = 0; i < i_size; ++i)
{
const unsigned index = i_indexes[i];
sumOfWeights += i_set[index][2];
}
const unsigned random = rand() % sumOfWeights + 1;
sumOfWeights = 0;
unsigned i = 0;
for (; i < i_size; ++i)
{
const unsigned index = i_indexes[i];
sumOfWeights += i_set[index][3];
if (sumOfWeights >= random)
{
break;
}
}
return i;
}
Generate unique random numbers
Well known Durstenfeld-Fisher-Yates algorithm may be used for generation unique random numbers. See this great explanation.
It requires N bytes of space, so if N value is defined at compiled time, we are able to allocate necessary space at compile time.
Now, we have to combine these two algorithms. We just need to use our own Random() function instead of standard rand() in unique numbers generation algorithm.
Code
template<unsigned N, unsigned K>
void Generate(Pair (&i_set)[N], unsigned (&o_res)[K])
{
unsigned deck[N];
for (unsigned i = 0; i < N; ++i)
{
deck[i] = i;
}
unsigned max = N - 1;
for (unsigned i = 0; i < K; ++i)
{
const unsigned index = Random(i_set, deck, max + 1);
std::swap(deck[max], deck[index]);
o_res[i] = i_set[deck[max]][0];
--max;
}
}
Usage
int main()
{
srand((unsigned)time(0));
const unsigned c_N = 5; // N
const unsigned c_K = 2; // K
Pair input[c_N] = {{0, 5}, {1, 3}, {2, 2}, {3, 5}, {4, 4}}; // input array
unsigned result[c_K] = {};
const unsigned c_total = 1000000; // number of iterations
unsigned counts[c_N] = {0}; // frequency counters
for (unsigned i = 0; i < c_total; ++i)
{
Generate<c_N, c_K>(input, result);
for (unsigned j = 0; j < c_K; ++j)
{
++counts[result[j]];
}
}
unsigned sumOfWeights = 0;
for (unsigned i = 0; i < c_N; ++i)
{
sumOfWeights += input[i][1];
}
for (unsigned i = 0; i < c_N; ++i)
{
std::cout << (double)counts[i]/c_K/c_total // empirical frequency
<< " | "
<< (double)input[i][1]/sumOfWeights // expected frequency
<< std::endl;
}
return 0;
}
Output
N = 5, K = 2
Frequencies
Empiricical | Expected
0.253813 | 0.263158
0.16584 | 0.157895
0.113878 | 0.105263
0.253582 | 0.263158
0.212888 | 0.210526
Corner case when weights are actually ignored
N = 5, K = 5
Frequencies
Empiricical | Expected
0.2 | 0.263158
0.2 | 0.157895
0.2 | 0.105263
0.2 | 0.263158
0.2 | 0.210526

I do assume that the ids in the output must be unique. This makes this problem a specific instance of random sampling problems.
The first approach that I can think of solves this in O(N^2) time, using O(N) memory (The input array itself plus constant memory).
I Assume that the weights are possitive.
Let A be the array of pairs.
1) Set N to be A.length
2) calculate the sum of all weights W.
3) Loop K times
3.1) r = rand(0,W)
3.2) loop on A and find the first index i such that A[1].w + ...+ A[i].w <= r < A[1].w + ... + A[i+1].w
3.3) add A[i].id to output
3.4) A[i] = A[N-1] (or swap if the array contents should be preserved)
3.5) N = N - 1
3.6) W = W - A[i].w

Related

Numbers of common distinct difference

Given two array A and B. Task to find the number of common distinct (difference of elements in two arrays).
Example :
A=[3,6,8]
B=[1,6,10]
so we get differenceSet for A
differenceSetA=[abs(3-6),abs(6-8),abs(8-3)]=[3,5,2]
similiarly
differenceSetB=[abs(1-6),abs(1-10),abs(6-10)]=[5,9,4]
Number of common elements=Intersection :{differenceSetA,differenceSetB}={5}
Answer= 1
My approach O(N^2)
int commonDifference(vector<int> A,vector<int> B){
int n=A.size();
int m=B.size();
unordered_set<int> differenceSetA;
unordered_set<int> differenceSetB;
for(int i=0;i<n;i++){
for(int j=i+1;j<n;j++){
differenceSetA.insert(abs(A[i]-A[j]));
}
}
for(int i=0;i<m;i++){
for(int j=i+1;j<m;j++){
differenceSetB.insert(abs(B[i]-B[j]));
}
}
int count=0;
for(auto &it:differenceSetA){
if(differenceSetB.find(it)!=differenceSetB.end()){
count++;
}
}
return count;
}
Please provide suggestions for optimizing the approach in O(N log N)
If n is the maximum range of a input array, then the set of all differences of a given array can be obtained in O(n logn), as explained in this SO post: find all differences in a array
Here is a brief recall of the method, with a few additional practical implementation details:
Create an array Posi of length 2*n = 2*range = 2*(Vmax - Vmin + 1), where elements whose index matches an element of the input are set to 1, other elements are set to 0. This can be created in O(m), where m is the size of the array.
For example, given in input array [1,4,5] of size m, we create an array [1,0,0,1,1].
Initialisation: Posi[i] = 0 for all i (i = 0 to 2*n)
Posi[A[i] - Vmin] = 1 (i = 0 to m)
Calculate the autocorrelation function of array Posi[]. This can be classically performed in three sub-steps
2.1 Calculate the FFT (size 2*n) of Posi[]array: Y[] = FFT(Posi)
2.2 Calculate the square amplitude of the result: Y2[k] = Y[k] * conj([Y[k])
2.3 Calculate the Inverse FFT of the result Diff[] = IFFT (Y2[])`
A few details are worth being mentioned here:
The reason why a size 2*n was selected, and not a size n, if that, is d is a valid difference, then -d is also a valid difference. The results corresponding to negative differences are available at positions i >= n
If you find more easy to perform FFT with a size a-power-of-two, than you can replace the size 2*n with a value n2k = 2^k, with n2k >= 2*n
The non-null differences correspond to non-null values in the array Diff[]:
`d` is a difference if `Diff[d] > 0`
Another important details is that a classical FFT is used (float calculations), then you encounter little errors. To take it into account, it is important to replace the IFFT output Diff[] with integer rounded values of the real part.
All that concerns one array only. As you want to calculate the number of common differences, then you have to:
calculate the arrays Diff_A[] and Diff_B[] for both sets A and B and then:
count = 0;
if (Diff_A[d] != 0) and (Diff_B[d] != 0) then count++;
A little Bonus
In order to avoid a plagiarism of the mentioned post, here is an additional explanation about the way to get the differences of one set, with the help of the FFT.
The input array A = {3, 6, 8} can mathematically be represented by the following z transform:
A(z) = z^3 + z^6 + z^8
Then the corresponding z-transform of the difference array is equal to the polynomial product:
D(z) = A(z) * A(z*) = (z^3 + z^6 + z^8) (z^(-3) + z^(-6) + z^(-8))
= z^(-5) + z^(-3) + z^(-2) + 3 + z^2 + z^3 + z^5
Then, we can note that A(z) is equal to a FFT of size N of the sequence [0 0 0 1 0 0 1 0 1] by taking:
z = exp (-i * 2 PI/ N), with i = sqrt(-1)
Note that here we consider the classical FFT in C, the complex field.
It is certainly possible to perform calculation in a Galois field, and then no rounding errors, as it is done for example to implement "classical" multiplications (with z = 10) for a large number of digits. This seems over-skilled here.

How to generate a list of ascending random integers

I have an external collection containing n elements that I want to select some number (k) of them at random, outputting the indices of those elements to some serialized data file. I want the indices to be output in strict ascending order, and for there to be no duplicates. Both n and k may be quite large, and it is generally not feasible to simply store entire arrays in memory of that size.
The first algorithm I came up with was to pick a random number r[0] from 1 to n-k... and then pick a successive random numbers r[i] from r[i-1]+1 to n-k+i, only needing to store two entries for 'r' at any one time. However, a fairly simple analysis reveals the the probability for selecting small numbers is inconsistent with what could have been if the entire set was equally distributed. For example, if n was a billion and k was half a billion, the probability of selecting the first entry with the approach I've just described is very tiny (1 in half a billion), where in actuality since half of the entries are being selected, the first should be selected 50% of the time. Even if I use external sorting to sort k random numbers, I would have to discard any duplicates, and try again. As k approaches n, the number of retries would continue to grow, with no guarantee of termination.
I would like to find a O(k) or O(k log k) algorithm to do this, if it is at all possible. The implementation language I will be using is C++11, but descriptions in pseudocode may still be helpful.
If in practice k has the same order of magnitude as n, perhaps very straightforward O(n) algorithm will suffice:
assert(k <= n);
std::uniform_real_distribution rnd;
for (int i = 0; i < n; i++) {
if (rnd(engine) * (n - i) < k) {
std::cout << i << std::endl;
k--;
}
}
It produces all ascending sequences with equal probability.
You can solve this recursively in O(k log k) if you partition in the middle of your range, and randomly sample from the hypergeometric probability distribution to choose how many values lie above and below the middle point (i.e. the values of k for each subsequence), then recurse for each:
int sample_hypergeometric(int n, int K, int N) // samples hypergeometric distribution and
// returns number of "successes" where there are n draws without replacement from
// a population of N with K possible successes.
// Something similar to scipy.stats.hypergeom.rvs in Python.
// In this case, "success" means the selected value lying below the midpoint.
{
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(0.0,1.0);
int successes = 0;
for(int trial = 0; trial < n; trial++)
{
if((int)(distribution(generator) * N) < K)
{
successes++;
K--;
}
N--;
}
return successes;
}
select_k_from_n(int start, int k, int n)
{
if(k == 0)
return;
if(k == 1)
{
output start + random(1 to n);
return;
}
// find the number of results below the mid-point:
int k1 = sample_hypergeometric(k, n >> 1, n);
select_k_from_n(start, k1, n >> 1);
select_k_from_n(start + (n >> 1), k - k1, n - (n >> 1));
}
Sampling from the binomial distribution could also be used to approximate the hypergeometric distribution with p = (n >> 1) / n, rejecting samples where k1 > (n >> 1).
As mentioned in my comment, use a std::set<int> to store the randomly generated integers such that the resulting container is inherently sorted and contains no duplicates. Example code snippet:
#include <random>
#include <set>
int main(void) {
std::set<int> random_set;
std::random_device rd;
std::mt19937 mt_eng(rd());
// min and max of random set range
const int m = 0; // min
const int n = 100; // max
std::uniform_int_distribution<> dist(m,n);
// number to generate
const int k = 50;
for (int i = 0; i < k; ++i) {
// only non-previously occurring values will be inserted
if (!random_set.insert(dist(mt_eng)).second)
--i;
}
}
Assuming that you can't store k random numbers in memory, you'll have to generate the numbers in strict random order. One way to do it would be to generate a number between 0 and n/k. Call that number x. The next number you have to generate is between x+1 and (n-x)/(k-1). Continue in that fashion until you've selected k numbers.
Basically, you're dividing the remaining range by the number of values left to generate, and then generating a number in the first section of that range.
An example. You want to generate 3 numbers between 0 and 99, inclusive. So you first generate a number between 0 and 33. Say you pick 10.
So now you need a number between 11 and 99. The remaining range consists of 89 values, and you have two values left to pick. So, 89/2 = 44. You need a number between 11 and 54. Say you pick 36.
Your remaining range is from 37 to 99, and you have one number left to choose. So pick a number at random between 37 and 99.
This won't give you a normal distribution, as once you choose a number it's impossible to get a number less than that in a subsequent choice. But it might be good enough for your purposes.
This pseudocode shows the basic idea.
pick_k_from_n(n, k)
{
num_left = k
last_k = 0;
while num_left > 0
{
// divide the remaining range into num_left partitions
range_size = (n - last_k) / num_left
// pick a number in the first partition
r = random(range_size) + last_k + 1
output(r)
last_k = r
num_left = num_left - 1
}
}
Note that this takes O(k) time and requires O(1) extra space.
You can do it in O(k) time with Floyd's algorithm (not Floyd-Warshall, that's a shortest path thing). The only data structure you need is a 1-bit table that will tell you whether or not a number has already been selected. Searching a hash table can be O(1), so this will not be a burden, and can be kept in memory even for very large n (if n is truly huge, you'll have to use a b-tree or bloom filter or something).
To select k items from among n:
for j = n-k+1 to n:
select random x from 1 to j
if x is already in hash:
insert j into hash
else
insert x into hash
That's it. At the end, your hash table will contain a uniformly selected sample of k items from among n. Read them out in order (you may have to pick a type of hash table that allows that).
Could you adjust each ascending index selection in a way that compensates for the probability distortion you are describing?
IANAS, but my guess would be that if you pick a random number r between 0 and 1 (that you'll scale to the full remaining index range after the adjustment), you might be able to adjust it by calculating r^(x) (keeping the range in 0..1, but increasing the probability of smaller numbers), with x selected by solving the equation for the probability of the first entry?
Here's an O(k log k + √n)-time algorithm that uses O(√n) words of space. This can be generalized to an O(k + n^(1/c))-time, O(n^(1/c))-space algorithm for any integer constant c.
For intuition, imagine a simple algorithm that uses (e.g.) Floyd's sampling algorithm to generate k of n elements and then radix sorts them in base √n. Instead of remembering what the actual samples are, we'll do a first pass where we run a variant of Floyd's where we remember only the number of samples in each bucket. The second pass is, for each bucket in order, to randomly resample the appropriate number of elements from the bucket range. There's a short proof involving conditional probability that this gives a uniform distribution.
# untested Python code for illustration
# b is the number of buckets (e.g., b ~ sqrt(n))
import random
def first_pass(n, k, b):
counts = [0] * b # list of b zeros
for j in range(n - k, n):
t = random.randrange(j + 1)
if t // b >= counts[t % b]: # intuitively, "t is not in the set"
counts[t % b] += 1
else:
counts[j % b] += 1
return counts

How to calculate the minimum cost to convert all n numbers in an array to m?

I have been given the following assignment:
Given N integers in the form of A(i) where 1≤i≤N, make each number
A(i) in the N numbers equal to M. To convert a number A(i) to M, it
will cost |M−Ai| units. Find out the minimum cost to convert all the N
numbers to M, so you should choose the best M to get the minimum cost.
Given:
1 <= N <= 10^5
1 <= A(i) <= 10^9
My approach was to calculate the sum of all numbers and find avg = sum / n and then subtract each number by avg to get the minimum cost.
But this fails in many test cases. How can I find the optimal solution for this?
You should take the median of the numbers (or either of the two numbers nearest the middle if the list has even length), not the mean.
An example where the mean fails to minimize is: [1, 2, 3, 4, 100]. The mean is 110 / 5 = 22, and the total cost is 21 + 20 + 19 + 18 + 78 = 156. Choosing the median (3) gives total cost: 2 + 1 + 0 + 1 + 97 = 101.
An example where the median lies between two items in the list is [1, 2, 3, 4, 5, 100]. Here the median is 3.5, and it's ok to either use M=3 or M=4. For M=3, the total cost is 2 + 1 + 0 + 1 + 2 + 97 = 103. For M=4, the total cost is 3 + 2 + 1 + 0 + 1 + 96 = 103.
A formal proof of correctness can be found on Mathematics SE, although you may convince yourself of the result by noting that if you nudge M a small amount delta in one direction (but not past one of the data points) -- and for example's sake let's say it's in the positive direction, the total cost increases by delta times the number of points to the left of M minus delta times the number of points to the right of M. So M is minimized when the number of points to its left and the right are equal in number, otherwise you could move it a small amount one way or the other to decrease the total cost.
#PaulHankin already provided a perfect answer. Anyway, when thinking about the problem, I didn't think of the median being the solution. But even if you don't know about the median, you can come up with a programming solution.
I made similar observations as #PaulHankin in the last paragraph of his last answer. This made me realize, that I have to eliminate outliers iteratively in order to find m. So I wrote a program that first sorts the input array (vector) A and then analyzes the minimum and maximum values.
The idea is to move the minimum values towards the second smallest values and the maximum values towards the second largest values. You always move either the minimum or maximum values, depending on whether you have less minimum values than maximum values or not. If all array items end up being the same value, then you found your m:
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
int getMinCount(vector<int>& A);
int getMaxCount(vector<int>& A);
int main()
{
// Example as given by #PaulHankin
vector<int> A;
A.push_back(1);
A.push_back(2);
A.push_back(3);
A.push_back(4);
A.push_back(100);
sort(A.begin(), A.end());
int minCount = getMinCount(A);
int maxCount = getMaxCount(A);
while (minCount != A.size() && maxCount != A.size())
{
if(minCount <= maxCount)
{
for(int i = 0; i < minCount; i++)
A[i] = A[minCount];
// Recalculate the count of the minium value, because we changed the minimum.
minCount = getMinCount(A);
}
else
{
for(int i = 0; i < maxCount; i++)
A[A.size() - 1 - i] = A[A.size() - 1 - maxCount];
// Recalculate the count of the maximum value, because we changed the maximum.
maxCount = getMaxCount(A);
}
}
// Print out the one and only remaining value, which is m.
cout << A[0] << endl;
return 0;
}
int getMinCount(vector<int>& A)
{
// Count how often the minimum value exists.
int minCount = 1;
int pos = 1;
while (pos < A.size() && A[pos++] == A[0])
minCount++;
return minCount;
}
int getMaxCount(vector<int>& A)
{
// Count how often the maximum value exists.
int maxCount = 1;
int pos = A.size() - 2;
while (pos >= 0 && A[pos--] == A[A.size() - 1])
maxCount++;
return maxCount;
}
If you think about the algorithm, then you will come to the conclusion, that it actually calculates the median of the values in the array A. As example input I took the first example given by #PaulHankin. As expected, the code provides the correct result (3) for it.
I hope my approach helps you to understand how to tackle such kind of problems even if you don't know the correct solution. This is especially helpful when you are in an interview, for example.

Sum of difference of a number to an array of numbers

This is my problem.
Given an array of integers and another integer k, find the sum of differences of each element of the array and k.
For example if the array is 2, 4, 6, 8, 10 and k is 3
Sum of difference
= abs(2 - 3) + abs(4-3) + abs(6 - 3) + abs(8 - 3) + abs(10 - 3)
= 1 + 1 + 3 + 5 + 7
= 17
The array remains the same throughout and can contain up to 100000 elements and there will be 100000 different values of k to be tested. k may or may not be an element of the array. This has to be done within 1s or about 100M operations. How do I achieve this?
You can run multiple queries for sums of absolute differences in O(log N) if you add a preprocessing step which costs O(N * log N).
Sort the array, then for each item in the array store the sum of all numbers that are smaller than or equal to the corresponding item. This can be done in O(N * log N) Now you have a pair of arrays that look like this:
2 4 6 8 10 // <<== Original data
2 6 12 20 30 // <<== Partial sums
In addition, store the total T of all numbers in the array.
Now you can get sums of absolute differences by running a binary search on the original array, and using the sums from the partial sums array to compute the answer: subtract the sum of all numbers to the left of the target k from the count of numbers to the left of the target times k, then subtract the count times k from the sum to the right of the number, and add the two numbers together. The partial sum of the numbers to the right of the number can be computed by subtracting the partial sum on the left from the total T.
For k=3 binary search gets you to position 1.
Partial sum on the left is 2
Count of items on the left is 1
Partial sum on the right is (30-2)=28
Count of items on the right is 4
You compute (1*3-2) + (28-4*3) = 1 + 16 = 17
First sort the array and then compute an array that stores the sum of the prefixes of the resulting sorted array. Let's denote this array p, you can compute p in linear time so that p[i] = a[0] + a[1] + ... a[i]. Now having this array you can answer with constant complexity the question what is the sum of elements a[x] + a[x+1] + .... +a[y](i.e. with indices x to y). To do that you simply compute p[y] - p[x-1](Take special care when x is 1).
Now to answer a query of the type what is the sum of absolute differences with k, we will split the problem in two parts - what is the sum of the numbers greater than k and the numbers smaller than k. In order to compute these, perform a binary search to find the position of k in the sorted a(denote that idx), and compute the sum of the values in a before idx(denote that s) and after idx(denote that S). Now the sum of absolute differences with k is idx * k - s + S - (a.length - idx)* k. This of course is pseudo code and what I mean by a.length is the number of elements in a.
After performing a linearithmic precomputation, you will be able to answer a query with O(log(n)). Please note this approach only makes sense if you plan to perform multiple queries. If you are only going to perform a single query, you can not possibly go faster than O(n).
Just implementing dasblinkenlight's solution in "contest C++":
It does exactly as he says. Reads the values, sorts them, stores the accumulated sum in V[i].second, but here V[i] is the acumulated sum until i-1 (to simplify the algorithm). It also stores a sentinel in V[n] for cases when the query is greater than max(V).
Then, for each query, binary search for the value. In this case V[a].second is the sum of values lesser than query, V[n].second-V[a].second is the sum of values greater than it.
#include<iostream>
#include<algorithm>
#define pii pair<int, int>
using namespace std;
pii V[100001];
int main() {
int n;
while(cin >> n) {
for(int i=0; i<n; i++)
cin >> V[i].first;
sort(V, V+n);
V[0].second = 0;
for(int i=1; i<=n; i++)
V[i].second = V[i-1].first + V[i-1].second;
int k; cin >> k;
for(int i=0; i<k; i++) {
int query; cin >> query;
pii* res = upper_bound(V, V+n, pii(query, 0));
int a = res-V, b=n-(res-V);
int left = query*a-V[a].second;
int right = V[n].second-V[a].second-query*b;
cout << left+right << endl;
}
}
}
It assumes a file with a format like this:
5
10 2 8 4 6
2
3 5
Then, for each query, it answers like this:
17
13

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?
Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.
If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.
You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.