Efficient way to check if sum is possible from a given set of numbers [duplicate]

Efficient way to check if sum is possible from a given set of numbers [duplicate] - c++

I've been tasked with helping some accountants solve a common problem they have - given a list of transactions and a total deposit, which transactions are part of the deposit? For example, say I have this list of numbers:
1.00
2.50
3.75
8.00
And I know that my total deposit is 10.50, I can easily see that it's made up of the 8.00 and 2.50 transaction. However, given a hundred transactions and a deposit in the millions, it quickly becomes much more difficult.
In testing a brute force solution (which takes way too long to be practical), I had two questions:
With a list of about 60 numbers, it seems to find a dozen or more combinations for any total that's reasonable. I was expecting a single combination to satisfy my total, or maybe a few possibilities, but there always seem to be a ton of combinations. Is there a math principle that describes why this is? It seems that given a collection of random numbers of even a medium size, you can find a multiple combination that adds up to just about any total you want.
I built a brute force solution for the problem, but it's clearly O(n!), and quickly grows out of control. Aside from the obvious shortcuts (exclude numbers larger than the total themselves), is there a way to shorten the time to calculate this?
Details on my current (super-slow) solution:
The list of detail amounts is sorted largest to smallest, and then the following process runs recursively:
Take the next item in the list and see if adding it to your running total makes your total match the target. If it does, set aside the current chain as a match. If it falls short of your target, add it to your running total, remove it from the list of detail amounts, and then call this process again
This way it excludes the larger numbers quickly, cutting the list down to only the numbers it needs to consider. However, it's still n! and larger lists never seem to finish, so I'm interested in any shortcuts I might be able to take to speed this up - I suspect that even cutting 1 number out of the list would cut the calculation time in half.
Thanks for your help!

This special case of the Knapsack problem is called Subset Sum.

C# version
setup test:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main(string[] args)
{
// subtotal list
List<double> totals = new List<double>(new double[] { 1, -1, 18, 23, 3.50, 8, 70, 99.50, 87, 22, 4, 4, 100.50, 120, 27, 101.50, 100.50 });
// get matches
List<double[]> results = Knapsack.MatchTotal(100.50, totals);
// print results
foreach (var result in results)
{
Console.WriteLine(string.Join(",", result));
}
Console.WriteLine("Done.");
Console.ReadKey();
}
}
code:
using System.Collections.Generic;
using System.Linq;
public class Knapsack
{
internal static List<double[]> MatchTotal(double theTotal, List<double> subTotals)
{
List<double[]> results = new List<double[]>();
while (subTotals.Contains(theTotal))
{
results.Add(new double[1] { theTotal });
subTotals.Remove(theTotal);
}
// if no subtotals were passed
// or all matched the Total
// return
if (subTotals.Count == 0)
return results;
subTotals.Sort();
double mostNegativeNumber = subTotals[0];
if (mostNegativeNumber > 0)
mostNegativeNumber = 0;
// if there aren't any negative values
// we can remove any values bigger than the total
if (mostNegativeNumber == 0)
subTotals.RemoveAll(d => d > theTotal);
// if there aren't any negative values
// and sum is less than the total no need to look further
if (mostNegativeNumber == 0 && subTotals.Sum() < theTotal)
return results;
// get the combinations for the remaining subTotals
// skip 1 since we already removed subTotals that match
for (int choose = 2; choose <= subTotals.Count; choose++)
{
// get combinations for each length
IEnumerable<IEnumerable<double>> combos = Combination.Combinations(subTotals.AsEnumerable(), choose);
// add combinations where the sum mathces the total to the result list
results.AddRange(from combo in combos
where combo.Sum() == theTotal
select combo.ToArray());
}
return results;
}
}
public static class Combination
{
public static IEnumerable<IEnumerable<T>> Combinations<T>(this IEnumerable<T> elements, int choose)
{
return choose == 0 ? // if choose = 0
new[] { new T[0] } : // return empty Type array
elements.SelectMany((element, i) => // else recursively iterate over array to create combinations
elements.Skip(i + 1).Combinations(choose - 1).Select(combo => (new[] { element }).Concat(combo)));
}
}
results:
100.5
100.5
-1,101.5
1,99.5
3.5,27,70
3.5,4,23,70
3.5,4,23,70
-1,1,3.5,27,70
1,3.5,4,22,70
1,3.5,4,22,70
1,3.5,8,18,70
-1,1,3.5,4,23,70
-1,1,3.5,4,23,70
1,3.5,4,4,18,70
-1,3.5,8,18,22,23,27
-1,3.5,4,4,18,22,23,27
Done.
If subTotals are repeated, there will appear to be duplicate results (the desired effect). In reality, you will probably want to use the subTotal Tupled with some ID, so you can relate it back to your data.

If I understand your problem correctly, you have a set of transactions, and you merely wish to know which of them could have been included in a given total. So if there are 4 possible transactions, then there are 2^4 = 16 possible sets to inspect. This problem is, for 100 possible transactions, the search space has 2^100 = 1267650600228229401496703205376 possible combinations to search over. For 1000 potential transactions in the mix, it grows to a total of
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
sets that you must test. Brute force will hardly be a viable solution on these problems.
Instead, use a solver that can handle knapsack problems. But even then, I'm not sure that you can generate a complete enumeration of all possible solutions without some variation of brute force.

There is a cheap Excel Add-in that solves this problem: SumMatch

The Excel Solver Addin as posted over on superuser.com has a great solution (if you have Excel) https://superuser.com/questions/204925/excel-find-a-subset-of-numbers-that-add-to-a-given-total

Its kind of like 0-1 Knapsack problem which is NP-complete and can be solved through dynamic programming in polynomial time.
http://en.wikipedia.org/wiki/Knapsack_problem
But at the end of the algorithm you also need to check that the sum is what you wanted.

Depending on your data you could first look at the cents portion of each transaction. Like in your initial example you know that 2.50 has to be part of the total because it is the only set of non-zero cent transactions which add to 50.

Not a super efficient solution but heres an implementation in coffeescript
combinations returns all possible combinations of the elements in list
combinations = (list) ->
permuations = Math.pow(2, list.length) - 1
out = []
combinations = []
while permuations
out = []
for i in [0..list.length]
y = ( 1 << i )
if( y & permuations and (y isnt permuations))
out.push(list[i])
if out.length <= list.length and out.length > 0
combinations.push(out)
permuations--
return combinations
and then find_components makes use of it to determine which numbers add up to total
find_components = (total, list) ->
# given a list that is assumed to have only unique elements
list_combinations = combinations(list)
for combination in list_combinations
sum = 0
for number in combination
sum += number
if sum is total
return combination
return []
Heres an example
list = [7.2, 3.3, 4.5, 6.0, 2, 4.1]
total = 7.2 + 2 + 4.1
console.log(find_components(total, list))
which returns [ 7.2, 2, 4.1 ]

#include <stdio.h>
#include <stdlib.h>
/* Takes at least 3 numbers as arguments.
* First number is desired sum.
* Find the subset of the rest that comes closest
* to the desired sum without going over.
*/
static long *elements;
static int nelements;
/* A linked list of some elements, not necessarily all */
/* The list represents the optimal subset for elements in the range [index..nelements-1] */
struct status {
long sum; /* sum of all the elements in the list */
struct status *next; /* points to next element in the list */
int index; /* index into elements array of this element */
};
/* find the subset of elements[startingat .. nelements-1] whose sum is closest to but does not exceed desiredsum */
struct status *reportoptimalsubset(long desiredsum, int startingat) {
struct status *sumcdr = NULL;
struct status *sumlist = NULL;
/* sum of zero elements or summing to zero */
if (startingat == nelements || desiredsum == 0) {
return NULL;
}
/* optimal sum using the current element */
/* if current elements[startingat] too big, it won't fit, don't try it */
if (elements[startingat] <= desiredsum) {
sumlist = malloc(sizeof(struct status));
sumlist->index = startingat;
sumlist->next = reportoptimalsubset(desiredsum - elements[startingat], startingat + 1);
sumlist->sum = elements[startingat] + (sumlist->next ? sumlist->next->sum : 0);
if (sumlist->sum == desiredsum)
return sumlist;
}
/* optimal sum not using current element */
sumcdr = reportoptimalsubset(desiredsum, startingat + 1);
if (!sumcdr) return sumlist;
if (!sumlist) return sumcdr;
return (sumcdr->sum < sumlist->sum) ? sumlist : sumcdr;
}
int main(int argc, char **argv) {
struct status *result = NULL;
long desiredsum = strtol(argv[1], NULL, 10);
nelements = argc - 2;
elements = malloc(sizeof(long) * nelements);
for (int i = 0; i < nelements; i++) {
elements[i] = strtol(argv[i + 2], NULL , 10);
}
result = reportoptimalsubset(desiredsum, 0);
if (result)
printf("optimal subset = %ld\n", result->sum);
while (result) {
printf("%ld + ", elements[result->index]);
result = result->next;
}
printf("\n");
}

Best to avoid use of floats and doubles when doing arithmetic and equality comparisons btw.

Related

Is Coin Change Algorithm That Output All Combinations Still Solvable By DP?

For example, total amount should be 5 and I have coins with values of 1 and 2. Then there are 3 ways of combinations:
1 1 1 1 1
1 1 1 2
1 2 2
I've seen some posts about how to calculate total number of combinations with dynamic programming or with recursion, but I want to output all the combinations like my example above. I've come up with a recursive solution below.
It's basically a backtracking algorithm, I start with the smallest coins first and try to get to the total amount, then I remove some coins and try using second smallest coins ... You can run my code below in http://cpp.sh/
The total amount is 10 and the available coin values are 1, 2, 5 in my code.
#include <iostream>
#include <stdlib.h>
#include <iomanip>
#include <cmath>
#include <vector>
using namespace std;
vector<vector<int>> res;
vector<int> values;
int total = 0;
void helper(vector<int>& curCoins, int current, int i){
int old = current;
if(i==values.size())
return;
int val = values[i];
while(current<total){
current += val;
curCoins.push_back(val);
}
if(current==total){
res.push_back(curCoins);
}
while (current>old) {
current -= val;
curCoins.pop_back();
if (current>=0) {
helper(curCoins, current, i+1);
}
}
}
int main(int argc, const char * argv[]) {
total = 10;
values = {1,2,5};
vector<int> chosenCoins;
helper(chosenCoins, 0, 0);
cout<<"number of combinations: "<<res.size()<<endl;
for (int i=0; i<res.size(); i++) {
for (int j=0; j<res[i].size(); j++) {
if(j!=0)
cout<<" ";
cout<<res[i][j];
}
cout<<endl;
}
return 0;
}
Is there a better solution to output all the combinations for this problem? Dynamic programming?
EDIT:
My question is is this problem solvable using dynamic programming?
Thanks for the help. I've implemented the DP version here: Coin Change DP Algorithm Print All Combinations

A DP solution:
We have
{solutions(n)} = Union ({solutions(n - 1) + coin1},
{solutions(n - 2) + coin2},
{solutions(n - 5) + coin5})
So in code:
using combi_set = std::set<std::array<int, 3u>>;
void append(combi_set& res, const combi_set& prev, const std::array<int, 3u>& values)
{
for (const auto& p : prev) {
res.insert({{{p[0] + values[0], p[1] + values[1], p[2] + values[2]}}});
}
}
combi_set computeCombi(int total)
{
std::vector<combi_set> combis(total + 1);
combis[0].insert({{{0, 0, 0}}});
for (int i = 1; i <= total; ++i) {
append(combis[i], combis[i - 1], {{1, 0, 0}});
if (i - 2 >= 0) { append(combis[i], combis[i - 2], {{0, 1, 0}}); }
if (i - 5 >= 0) { append(combis[i], combis[i - 5], {{0, 0, 1}}); }
}
return combis[total];
}
Live Demo.

Exhaustive search is unlikely to be 'better' with dynamic programming, but here's a possible solution:
Start with a 2d array of combination strings, arr[value][index] where value is the total worth of the coins. Let X be target value;
starting from arr[0][0] = "";
for each coin denomination n, from i = 0 to X-n you copy all the strings from arr[i] to arr[i+n] and append n to each of the strings.
for example with n=5 you would end up with
arr[0][0] = "", arr[5][0] = "5" and arr[10][0] = "5 5"
Hope that made sense. Typical DP would just count instead of having strings (you can also replace the strings with int vector to keep count instead)

Assume that you have K the total size of the output your are expecting (the total number of coins in all the combinations). Obviously you can not have a solution that runs faster than O(K), if you actually need to output all them. As K can be very large, this will be a very long running time, and in the worst case you will get little profit from the dynamic programming.
However, you still can do better than your straightforward recursive solution. Namely, you can have a solution running in O(N*S+K), where N is the number of coins you have and S is the total sum. This will not be better than straightforward solution for the worst possible K, but if K is not so big, you will get it running faster than your recursive solution.
This O(N*S+K) solution can be relatively simply coded. First you run the standard DP solution to find out for each sum current and each i whether the sum current can be composed of first i coin types. You do not yet calculate all the solutions, you just find out whether at least one solution exists for each current and i. Then, you write a recursive function similar to what you have already written, but before you try each combination, you check using you DP table whether it is worth trying, that is, whether at least one solution exists. Something like:
void helper(vector<int>& curCoins, int current, int i){
if (!solutionExists[current, i]) return;
// then your code goes
this way each branch of the recursion tree will finish in finding a solution, and therefore the total recursion tree size will be O(K), and the total running time will be O(N*S+K).
Note also that all this is worth only if you really need to output all the combinations. If you need to do something else with the combinations you get, it is very probable that you do not actually need all the combinations and you may adapt the DP solution for that. For example, if you want to print only m-th of all solutions, this can be done in O(N*S).

You just need to make two passes over the data structure (a hash table will work well as long as you've got a relatively small number of coins).
The first one finds all unique sums less than the desired total (actually you could stop perhaps at 1/2 the desired total) and records the simplest way (least additions required) to obtain that sum. This is essentially the same as the DP.
The second pass then goes starts at the desired total and works its way backwards through the data to output all ways that the total can be generated.
This ends up being a two stage approach of what Petr is suggesting.

The actual amount of non distinct valid combinations for amounts {1, 2, 5} and N = 10 is 128, using a pure recursive exhaustive technique (Code below). My question is can an exhaustive search be improved with memoization/dynamic programming. If so, how can I modify the algorithm below to incorporate such techniques.
public class Recursive {
static int[] combo = new int[100];
public static void main(String argv[]) {
int n = 10;
int[] amounts = {1, 2, 5};
ways(n, amounts, combo, 0, 0, 0);
}
public static void ways(int n, int[] amounts, int[] combo, int startIndex, int sum, int index) {
if(sum == n) {
printArray(combo, index);
}
if(sum > n) {
return;
}
for(int i=0;i<amounts.length;i++) {
sum = sum + amounts[i];
combo[index] = amounts[i];
ways(n, amounts, combo, startIndex, sum, index + 1);
sum = sum - amounts[i];
}
}
public static void printArray(int[] combo, int index) {
for(int i=0;i < index; i++) {
System.out.print(combo[i] + " ");
}
System.out.println();
}
}

Understanding Sum of subsets

I've just started learning Backtracking algorithms at college. Somehow I've managed to make a program for the Subset-Sum problem. Works fine but then i discovered that my program doesn't give out all the possible combinations.
For example : There might be a hundred combinations to a target sum but my program gives only 30.
Here is the code. It would be a great help if anyone could point out what my mistake is.
int tot=0;//tot is the total sum of all the numbers in the set.
int prob[500], d, s[100], top = -1, n; // n = number of elements in the set. prob[i] is the array with the set.
void subset()
{
int i=0,sum=0; //sum - being updated at every iteration and check if it matches 'd'
while(i<n)
{
if((sum+prob[i] <= d)&&(prob[i] <= d))
{
s[++top] = i;
sum+=prob[i];
}
if(sum == d) // d is the target sum
{
show(); // this function just displays the integer array 's'
top = -1; // top points to the recent number added to the int array 's'
i = s[top+1];
sum = 0;
}
i++;
while(i == n && top!=-1)
{
sum-=prob[s[top]];
i = s[top--]+1;
}
}
}
int main()
{
cout<<"Enter number of elements : ";cin>>n;
cout<<"Enter required sum : ";cin>>d;
cout<<"Enter SET :\n";
for(int i=0;i<n;i++)
{
cin>>prob[i];
tot+=prob[i];
}
if(d <= tot)
{
subset();
}
return 0;
}
When I run the program :
Enter number of elements : 7
Enter the required sum : 12
Enter SET :
4 3 2 6 8 12 21
SOLUTION 1 : 4, 2, 6
SOLUTION 2 : 12
Although 4, 8 is also a solution, my program doesnt show it.
Its even worse with the number of inputs as 100 or more. There will be atleast 10000 combinations, but my program shows 100.
The Logic which I am trying to follow :
Take in the elements of the main SET into a subset as long as the
sum of the subset remains less than or equal to the target sum.
If the addition of a particular number to the subset sum makes it
larger than the target, it doesnt take it.
Once it reaches the end
of the set, and answer has not been found, it removes the most
recently taken number from the set and starts looking at the numbers
in the position after the position of the recent number removed.
(since what i store in the array 's' is the positions of the
selected numbers from the main SET).

The solutions you are going to find depend on the order of the entries in the set due to your "as long as" clause in step 1.
If you take entries as long as they don't get you over the target, once you've taken e.g. '4' and '2', '8' will take you over the target, so as long as '2' is in your set before '8', you'll never get a subset with '4' and '8'.
You should either add a possibility to skip adding an entry (or add it to one subset but not to another) or change the order of your set and re-examine it.

It may be that a stack-free solution is possible, but the usual (and generally easiest!) way to implement backtracking algorithms is through recursion, e.g.:
int i = 0, n; // i needs to be visible to show()
int s[100];
// Considering only the subset of prob[] values whose indexes are >= start,
// print all subsets that sum to total.
void new_subsets(int start, int total) {
if (total == 0) show(); // total == 0 means we already have a solution
// Look for the next number that could fit
while (start < n && prob[start] > total) {
++start;
}
if (start < n) {
// We found a number, prob[start], that can be added without overflow.
// Try including it by solving the subproblem that results.
s[i++] = start;
new_subsets(start + 1, total - prob[start]);
i--;
// Now try excluding it by solving the subproblem that results.
new_subsets(start + 1, total);
}
}
You would then call this from main() with new_subsets(0, d);. Recursion can be tricky to understand at first, but it's important to get your head around it -- try easier problems (e.g. generating Fibonacci numbers recursively) if the above doesn't make any sense.
Working instead with the solution you have given, one problem I can see is that as soon as you find a solution, you wipe it out and start looking for a new solution from the number to the right of the first number that was included in this solution (top = -1; i = s[top+1]; implies i = s[0], and there is a subsequent i++;). This will miss solutions that begin with the same first number. You should just do if (sum == d) { show(); } instead, to make sure you get them all.
I initially found your inner while loop pretty confusing, but I think it's actually doing the right thing: once i hits the end of the array, it will delete the last number added to the partial solution, and if this number was the last number in the array, it will loop again to delete the second-to-last number from the partial solution. It can never loop more than twice because numbers included in a partial solution are all at distinct positions.

I haven't analysed the algorithm in detail, but what struck me is that your algorithm doesn't account for the possibility that, after having one solution that starts with number X, there could be multiple solutions starting with that number.
A first improvement would be to avoid resetting your stack s and the running sum after you printed the solution.

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?

Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.

If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.

You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.

Weighted probability with long doubles

I am working with an array of roughly 2000 elements in C++.
Each element represents the probability of that element being selected randomly.
I then have convert this array into a cumulative array, with the intention of using this to work out which element to choose when a dice is rolled.
Example array:
{1,2,3,4,5}
Example cumulative array:
{1,3,6,10,15}
I want to be able to select 3 in the cumulative array when numbers 3, 4 or 5 are rolled.
The added complexity is that my array is made up of long doubles. Here's an example of a few consecutive elements:
0.96930161525189592646367317541056252139242133125662803649902343750
0.96941377254127855667142910078837303444743156433105468750000000000
0.96944321382974149711383993199831365927821025252342224121093750000
0.96946143938926617454089618153290075497352518141269683837890625000
0.96950069444055009509463721739663810694764833897352218627929687500
0.96951751803395748961766908990966840065084397792816162109375000000
This could be a terrible way of doing weighted probabilities with this data set, so I'm open to any suggestions of better ways of working this out.

You can use partial_sum:
unsigned int SIZE = 5;
int array[SIZE] = {1,2,3,4,5};
int partials[SIZE] = {0};
partial_sum(array, array+SIZE, partials);
// partials is now {1,3,6,10,15}
The value you want from the array is available from the partial sums:
12 == array[2] + array[3] + array[4];
12 == partials[4] - partials[1];
The total is obviously the last value in the partial sums:
15 == partial[4];

consider storing the information as an integer numerator and denominator so that there is no loss of precision until the final step.

You can actually do this using stream selection without having to compute an array of partial sums. Here's code I have for this in Java:
public static int selectRandomWeighted(double[] wts, Random rnd) {
int selected = 0;
double total = wts[0];
for( int i = 1; i < wts.length; i++ ) {
total += wts[i];
if( rnd.nextDouble() <= (wts[i] / total)) {
selected = i;
}
}
return selected;
}
The above could potentially be further improved using Kahan summation if you want to preserve as many digits of accuracy in the sum as possible.
However, if you want to draw from this array repeatedly, then pre-computing an array of partial sums and using binary search to find the right index will be faster.

Ok I think I've solved this one.
I just did a binary split search, but instead of just having
if (arr[middle] == value)
I added in an OR
if (arr[middle] == value || (arr[middle] < value && arr[middle+1] > value))
This seems to handle it in the way I was hoping for.

Fastest way to obtain the largest X numbers from a very large unsorted list?

I'm trying to obtain the top say, 100 scores from a list of scores being generated by my program. Unfortuatly the list is huge (on the order of millions to billions) so sorting is a time intensive portion of the program.
Whats the best way of doing the sorting to get the top 100 scores?
The only two methods i can think of so far is either first generating all the scores into a massive array and then sorting it and taking the top 100. Or second, generating X number of scores, sorting it and truncating the top 100 scores then continue generating more scores, adding them to the truncated list and then sorting it again.
Either way I do it, it still takes more time than i would like, any ideas on how to do it in an even more efficient way? (I've never taken programming courses before, maybe those of you with comp sci degrees know about efficient algorithms to do this, at least that's what I'm hoping).
Lastly, whats the sorting algorithm used by the standard sort() function in c++?
Thanks,
-Faken
Edit: Just for anyone who is curious...
I did a few time trials on the before and after and here are the results:
Old program (preforms sorting after each outer loop iteration):
top 100 scores: 147 seconds
top 10 scores: 147 seconds
top 1 scores: 146 seconds
Sorting disabled: 55 seconds
new program (implementing tracking of only top scores and using default sorting function):
top 100 scores: 350 seconds <-- hmm...worse than before
top 10 scores: 103 seconds
top 1 scores: 69 seconds
Sorting disabled: 51 seconds
new rewrite (optimizations in data stored, hand written sorting algorithm):
top 100 scores: 71 seconds <-- Very nice!
top 10 scores: 52 seconds
top 1 scores: 51 seconds
Sorting disabled: 50 seconds
Done on a core 2, 1.6 GHz...I can't wait till my core i7 860 arrives...
There's a lot of other even more aggressive optimizations for me to work out (mainly in the area of reducing the number of iterations i run), but as it stands right now, the speed is more than good enough, i might not even bother to work out those algorithm optimizations.
Thanks to eveyrone for their input!

take the first 100 scores, and sort them in an array.
take the next score, and insertion-sort it into the array (starting at the "small" end)
drop the 101st value
continue with the next value, at 2, until done
Over time, the list will resemble the 100 largest value more and more, so more often, you find that the insertion sort immediately aborts, finding that the new value is smaller than the smallest value of the candidates for the top 100.

You can do this in O(n) time, without any sorting, using a heap:
#!/usr/bin/python
import heapq
def top_n(l, n):
top_n = []
smallest = None
for elem in l:
if len(top_n) < n:
top_n.append(elem)
if len(top_n) == n:
heapq.heapify(top_n)
smallest = heapq.nsmallest(1, top_n)[0]
else:
if elem > smallest:
heapq.heapreplace(top_n, elem)
smallest = heapq.nsmallest(1, top_n)[0]
return sorted(top_n)
def random_ints(n):
import random
for i in range(0, n):
yield random.randint(0, 10000)
print top_n(random_ints(1000000), 100)
Times on my machine (Core2 Q6600, Linux, Python 2.6, measured with bash time builtin):
100000 elements: .29 seconds
1000000 elements: 2.8 seconds
10000000 elements: 25.2 seconds
Edit/addition: In C++, you can use std::priority_queue in much the same way as Python's heapq module is used here. You'll want to use the std::greater ordering instead of the default std::less, so that the top() member function returns the smallest element instead of the largest one. C++'s priority queue doesn't have the equivalent of heapreplace, which replaces the top element with a new one, so instead you'll want to pop the top (smallest) element and then push the newly seen value. Other than that the algorithm translates quite cleanly from Python to C++.

Here's the 'natural' C++ way to do this:
std::vector<Score> v;
// fill in v
std::partial_sort(v.begin(), v.begin() + 100, v.end(), std::greater<Score>());
std::sort(v.begin(), v.begin() + 100);
This is linear in the number of scores.
The algorithm used by std::sort isn't specified by the standard, but libstdc++ (used by g++) uses an "adaptive introsort", which is essentially a median-of-3 quicksort down to a certain level, followed by an insertion sort.

Declare an array where you can put the 100 best scores. Loop through the huge list and check for each item if it qualifies to be inserted in the top 100. Use a simple insert sort to add an item to the top list.
Something like this (C# code, but you get the idea):
Score[] toplist = new Score[100];
int size = 0;
foreach (Score score in hugeList) {
int pos = size;
while (pos > 0 && toplist[pos - 1] < score) {
pos--;
if (pos < 99) toplist[pos + 1] = toplist[pos];
}
if (size < 100) size++;
if (pos < size) toplist[pos] = score;
}
I tested it on my computer (Code 2 Duo 2.54 MHz Win 7 x64) and I can process 100.000.000 items in 369 ms.

Since speed is of the essence here, and 40.000 possible highscore values is totally maintainable by any of today's computers, I'd resort to bucket sort for simplicity. My guess is that it would outperform any of the algorithms proposed thus far. The downside is that you'd have to determine some upper limit for the highscore values.
So, let's assume your max highscore value is 40.000:
Make an array of 40.000 entries. Loop through your highscore values. Each time you encounter highscore x, increase your array[x] by one. After this, all you have to do is count the top entries in your array until you have reached 100 counted highscores.

You can do it in Haskell like this:
largest100 xs = take 100 $ sortBy (flip compare) xs
This looks like it sorts all the numbers into descending order (the "flip compare" bit reverses the arguments to the standard comparison function) and then returns the first 100 entries from the list. But Haskell is lazily evaluated, so the sortBy function does just enough sorting to find the first 100 numbers in the list, and then stops.
Purists will note that you could also write the function as
largest100 = take 100 . sortBy (flip compare)
This means just the same thing, but illustrates the Haskell style of composing a new function out of the building blocks of other functions rather than handing variables around the place.

You want the absolute largest X numbers, so I'm guessing you don't want some sort of heuristic. How unsorted is the list? If it's pretty random, your best bet really is just to do a quick sort on the whole list and grab the top X results.
If you can filter scores during the list generation, that's way way better. Only ever store X values, and every time you get a new value, compare it to those X values. If it's less than all of them, throw it out. If it's bigger than one of them, throw out the new smallest value.
If X is small enough you can even keep your list of X values sorted so that you are comparing your new number to a sorted list of values, you can make an O(1) check to see if the new value is smaller than all of the rest and thus throw it out. Otherwise, a quick binary search can find where the new value goes in the list and then you can throw away the first value of the array (assuming the first element is the smallest element).

Place the data into a balanced Tree structure (probably Red-Black tree) that does the sorting in place. Insertions should be O(lg n). Grabbing the highest x scores should be O(lg n) as well.
You can prune the tree every once in awhile if you find you need optimizations at some point.

If you only need to report the value of top 100 scores (and not any associated data), and if you know that the scores will all be in a finite range such as [0,100], then an easy way to do it is with "counting sort"...
Basically, create an array representing all possible values (e.g. an array of size 101 if scores can range from 0 to 100 inclusive), and initialize all the elements of the array with a value of 0. Then, iterate through the list of scores, incrementing the corresponding entry in the list of achieved scores. That is, compile the number of times each score in the range has been achieved. Then, working from the end of the array to the beginning of the array, you can pick out the top X score. Here is some pseudo-code:
let type Score be an integer ranging from 0 to 100, inclusive.
let scores be an array of Score objects
let scorerange be an array of integers of size 101.
for i in [0,100]
set scorerange[i] = 0
for each score in scores
set scorerange[score] = scorerange[score] + 1
let top be the number of top scores to report
let idx be an integer initialized to the end of scorerange (i.e. 100)
while (top > 0) and (idx>=0):
if scorerange[idx] > 0:
report "There are " scorerange[idx] " scores with value " idx
top = top - scorerange[idx]
idx = idx - 1;

I answered this question in response to an interview question in 2008. I implemented a templatized priority queue in C#.
using System;
using System.Collections.Generic;
using System.Text;
namespace CompanyTest
{
// Based on pre-generics C# implementation at
// http://www.boyet.com/Articles/WritingapriorityqueueinC.html
// and wikipedia article
// http://en.wikipedia.org/wiki/Binary_heap
class PriorityQueue<T>
{
struct Pair
{
T val;
int priority;
public Pair(T v, int p)
{
this.val = v;
this.priority = p;
}
public T Val { get { return this.val; } }
public int Priority { get { return this.priority; } }
}
#region Private members
private System.Collections.Generic.List<Pair> array = new System.Collections.Generic.List<Pair>();
#endregion
#region Constructor
public PriorityQueue()
{
}
#endregion
#region Public methods
public void Enqueue(T val, int priority)
{
Pair p = new Pair(val, priority);
array.Add(p);
bubbleUp(array.Count - 1);
}
public T Dequeue()
{
if (array.Count <= 0)
throw new System.InvalidOperationException("Queue is empty");
else
{
Pair result = array[0];
array[0] = array[array.Count - 1];
array.RemoveAt(array.Count - 1);
if (array.Count > 0)
trickleDown(0);
return result.Val;
}
}
#endregion
#region Private methods
private static int ParentOf(int index)
{
return (index - 1) / 2;
}
private static int LeftChildOf(int index)
{
return (index * 2) + 1;
}
private static bool ParentIsLowerPriority(Pair parent, Pair item)
{
return (parent.Priority < item.Priority);
}
// Move high priority items from bottom up the heap
private void bubbleUp(int index)
{
Pair item = array[index];
int parent = ParentOf(index);
while ((index > 0) && ParentIsLowerPriority(array[parent], item))
{
// Parent is lower priority -- move it down
array[index] = array[parent];
index = parent;
parent = ParentOf(index);
}
// Write the item once in its correct place
array[index] = item;
}
// Push low priority items from the top of the down
private void trickleDown(int index)
{
Pair item = array[index];
int child = LeftChildOf(index);
while (child < array.Count)
{
bool rightChildExists = ((child + 1) < array.Count);
if (rightChildExists)
{
bool rightChildIsHigherPriority = (array[child].Priority < array[child + 1].Priority);
if (rightChildIsHigherPriority)
child++;
}
// array[child] points at higher priority sibling -- move it up
array[index] = array[child];
index = child;
child = LeftChildOf(index);
}
// Put the former root in its correct place
array[index] = item;
bubbleUp(index);
}
#endregion
}
}

Median of medians algorithm.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js