Quicksort weird time complexity, c++

Quicksort weird time complexity, c++ - c++

I've been testing the time complexity of different sorting algorithms for different number sequences and it was all going well until i got quicksort's (with pivot in the middle) results for sequences that are one half ascending and the other descending. The graph:
(By "V" I mean a sequence in which the first half is descending, and the other ascending, and by "A" I mean a sequence where the first half is ascending, and the other half is descending.)
Results for other kinds of sequences look as I would expect, but maybe there is something wrong with my algorithm?
void quicksort(int l,int p,int *tab)
{
int i=l,j=p,x=tab[(l+p)/2],w; //x - pivot
do
{
while (tab[i]<x)
{
i++;
}
while (x<tab[j])
{
j--;
}
if (i<=j)
{
w=tab[i];
tab[i]=tab[j];
tab[j]=w;
i++;
j--;
}
}
while (i<=j);
if (l<j)
{
quicksort(l,j,tab);
}
if (i<p)
{
quicksort(i,p,tab);
}
}
Does anybody have an idea what caused such weird results?

TL;DR: The problem is the pivot-choosing strategy, which makes repeatedly poor choices on these types of inputs (A- and V-shaped sequences). These result in quicksort making highly "imbalanced" recursive calls, which in turn result in the algorithm performing very poorly (quadratic time for A-shaped sequences).
Congratulations, you've (re)discovered an adversarial input (or rather a family of inputs) for the version of quicksort that chooses the middle element as the pivot.
For the reference, an example of an A-shaped sequence is 1 2 3 4 3 2 1, i.e., a sequence that increases, reaches the pick at the middle, and then decreases; an example of a V-shaped sequence is 4 3 2 1 2 3 4, i.e., a sequence that decreases, reaches the minimum at the middle, and then increases.
Think about what happens when you pick the middle element as the pivot of an A- or a V-shaped sequence. In the first case, when you pass the algorithm the A-shaped sequence 1 2 ... n-1 n n-1 ... 2 1, the pivot is the largest element of the array---this is because the largest element of an A-shaped sequence is the middle one, and you choose the middle element as the pivot---and you will make recursive calls on subarrays of sizes 0 (your code doesn't actually make the call on 0 elements) and n-1. In the next call on the subarray of size n-1 you will pick as the pivot the largest element of the subarray (which is the second-largest element of the original array); and so on. This results in poor performance because the running time is O(n)+O(n-1)+...+O(1) = O(n^2) because in each step you essentially pass on almost the whole array (all elements except the pivot), in other words, the sizes of the arrays in the recursive calls are highly imbalanced.
Here's the trace for the A-shaped sequence 1 2 3 4 5 4 3 2 1:
blazs#blazs:/tmp$ ./test
pivot=5
1 2 3 4 1 4 3 2 5
pivot=4
1 2 3 2 1 3 4 4
pivot=3
1 2 3 2 1 3
pivot=3
1 2 1 2 3
pivot=2
1 2 1 2
pivot=2
1 1 2
pivot=1
1 1
pivot=4
4 4
1 1 2 2 3 3 4 4 5
You can see from the trace that at recursive call the algorithm chooses a largest element (there can be up to two largest elements, hence the article a, not the) as the pivot. This means that the running time for the A-shaped sequence really is O(n)+O(n-1)+...+O(1) = O(n^2). (In the technical jargon, the A-shaped sequence is an example of an adversarial input that forces the algorithm to perform poorly.)
This means that if you plot running times for "perfectly" A-shaped sequences of the form
1 2 3 ... n-1 n n-1 ... 3 2 1
for increasing n, you will see a nice quadratic function. Here's a graph I just computed for n=5,105, 205, 305,...,9905 for A-shaped sequences 1 2 ... n-1 n n-1 ... 2 1:
In the second case, when you pass the algorithm a V-shaped sequence, you choose the smallest element of the array as the pivot, and will thus make recursive calls on subarrays of sizes n-1 and 0 (your code doesn't actually make the call on 0 elements). In the next call on the subarray of size n-1 you will pick as the pivot the largest element; and so on. (But you won't always make such terrible choices; it's hard to say anything more about this case.) This results in poor performance for similar reasons. This case is slightly more complicated (it depends on how you do the "moving" step).
Here's a graph of running times for V-shaped sequences n n-1 ... 2 1 2 ... n-1 n for n=5,105,205,...,49905. The running times are somewhat less regular---as I said it is more complicated because you don't always pick the smallest element as the pivot. The graph:
Code that I used to measure time:
double seconds(size_t n) {
int *tab = (int *)malloc(sizeof(int) * (2*n - 1));
size_t i;
// construct A-shaped sequence 1 2 3 ... n-1 n n-1 ... 3 2 1
for (i = 0; i < n-1; i++) {
tab[i] = tab[2*n-i-2] = i+1;
// To generate V-shaped sequence, use tab[i]=tab[2*n-i-2]=n-i+1;
}
tab[n-1] = n;
// For V-shaped sequence use tab[n-1] = 1;
clock_t start = clock();
quicksort(0, 2*n-2, tab);
clock_t finish = clock();
free(tab);
return (double) (finish - start) / CLOCKS_PER_SEC;
}
I adapted your code to print the "trace" of the algorithm so that you can play with it yourself and gain insight into what's going on:
#include <stdio.h>
void print(int *a, size_t l, size_t r);
void quicksort(int l,int p,int *tab);
int main() {
int tab[] = {1,2,3,4,5,4,3,2,1};
size_t sz = sizeof(tab) / sizeof(int);
quicksort(0, sz-1, tab);
print(tab, 0, sz-1);
return 0;
}
void print(int *a, size_t l, size_t r) {
size_t i;
for (i = l; i <= r; ++i) {
printf("%4d", a[i]);
}
printf("\n");
}
void quicksort(int l,int p,int *tab)
{
int i=l,j=p,x=tab[(l+p)/2],w; //x - pivot
printf("pivot=%d\n", x);
do
{
while (tab[i]<x)
{
i++;
}
while (x<tab[j])
{
j--;
}
if (i<=j)
{
w=tab[i];
tab[i]=tab[j];
tab[j]=w;
i++;
j--;
}
}
while (i<=j);
print(tab, l, p);
if (l<j)
{
quicksort(l,j,tab);
}
if (i<p)
{
quicksort(i,p,tab);
}
}
By the way, I think the graph showing the running times would be smoother if you took the average of, say, 100 running times for each input sequence.
We see that the problem here is the pivot-choosing strategy. Let me note that you can alleviate the problems with adversarial inputs by randomizing the pivot-choosing step. The simplest approach is to pick the pivot uniformly at random (each element is equally likely to be chosen as the pivot); you can then show that the algorithm runs in O(n log n) time with high probability. (Note, however, that to show this sharp tail bound you need some assumptions on the input; the result certainly holds if the numbers are all distinct; see, for example, Motwani and Raghavan's Randomized Algorithms book.)
To corroborate my claims, here's the graph of running times for the same sequences if you choose the pivot uniformly at random, with x = tab[l + (rand() % (p-l))]; (make sure you call srand(time(NULL)) in the main).
For A-shaped sequences:
For V-shaped sequences:

in QuickSort the one of the major things that affect running time is making the input ramdom.
generally choosing a pivot at a particular position may not really be the best except its certain that the input is randomly shuffled. Using the median of three partition is one of the widely used means just to make sure that the pivot is a random number. From your code you didn't implement it.
Also, when recursive quicksort will experience some overhead since its used internal stack( will have to generate several function and assign parameters) , so its advisable that when the size of the data left is around 10 - 20 you can use other sort algorithm like InsertionSort as this will make it about 20% faster.
void quicksort(int l,int p,int *tab){
if ( tab.size <= 10 ){
IntersionSort(tab);
}
..
..}
Something of this nature.
In General best running time for quicksort is nlogn
worse case running time is n^2 often caused by non-random inputs or duplicates inputs

All answers here have very good points. My idea is, that there is nothing wrong with the algorithm (since the pivot problem is well known and it is a reason for O(n^2), but there is something wrong with the way you measure it.
clock() - returns number of processor ticks elapsed from some point (probably program launch? Not important).
Your way of measuring time relly on constant length of tick, which I think isn't guaranteed.
Point is, that many (all?) of todays modern processors dynamically change their frequency to save energy. I think, it is very non-determinstic, so every time you launch your program - CPU frequency will depend not only on size of your input, but also on what is happening in you system right now. The way I understand it is, that the length of one tick can be very different during program execution.
I tried to lookup, what macro CLOCKS_PER_SEC actually does. Is it current clocks per sec? Does it do some averages during some mysterious time period? I sadlly wasn't able to find out. Therefore, I think that your way of measuring time can be totaly wrong.
Since my argument stands on something, I don't know for sure, I might be totaly wrong.
One way to find out is run multiple tests with same data multiple times with diferrent overall system usage and see, if it behaves significally different each time. Another way is, to set your computer's CPU frequency to some static value and test it similar way.
IDEA Wouldn't be better to measure "time" in ticks?
EDIT 1 Thanks to #BeyelerStudios, now we for certain know, that you shouldn't relly on clock() on Windows machines, because it doesn't follow C98 standard. Source
I hope I helped, if I am wrong, please correct me - I am a student and not a HW specialist.

Quicksort has a worst-case time complexity of O(n^2) and an average of O(n log n) for n entries in a data set. More details on an analysis of the time complexity can be found here:
https://www.khanacademy.org/computing/computer-science/algorithms/quick-sort/a/analysis-of-quicksort
and here:
http://www.cise.ufl.edu/class/cot3100fa07/quicksort_analysis.pdf

Related

Every sum possibilities of elements

From a given array (call it numbers[]), i want another array (results[]) which contains all sum possibilities between elements of the first array.
For example, if I have numbers[] = {1,3,5}, results[] will be {1,3,5,4,8,6,9,0}.
there are 2^n possibilities.
It doesn't matter if a number appears two times because results[] will be a set
I did it for sum of pairs or triplet, and it's very easy. But I don't understand how it works when we sum 0, 1, 2 or n numbers.
This is what I did for pairs :
std::unordered_set<int> pairPossibilities(std::vector<int> &numbers) {
std::unordered_set<int> results;
for(int i=0;i<numbers.size()-1;i++) {
for(int j=i+1;j<numbers.size();j++) {
results.insert(numbers.at(i)+numbers.at(j));
}
}
return results;
}
Also, assuming that the numbers[] is sorted, is there any possibility to sort results[] while we fill it ?
Thanks!

This can be done with Dynamic Programming (DP) in O(n*W) where W = sum{numbers}.
This is basically the same solution of Subset Sum Problem, exploiting the fact that the problem has optimal substructure.
DP[i, 0] = true
DP[-1, w] = false w != 0
DP[i, w] = DP[i-1, w] OR DP[i-1, w - numbers[i]]
Start by following the above solution to find DP[n, sum{numbers}].
As a result, you will get:
DP[n , w] = true if and only if w can be constructed from numbers

Following on from the Dynamic Programming answer, You could go with a recursive solution, and then use memoization to cache the results, top-down approach in contrast to Amit's bottom-up.
vector<int> subsetSum(vector<int>& nums)
{
vector<int> ans;
generateSubsetSum(ans,0,nums,0);
return ans;
}
void generateSubsetSum(vector<int>& ans, int sum, vector<int>& nums, int i)
{
if(i == nums.size() )
{
ans.push_back(sum);
return;
}
generateSubsetSum(ans,sum + nums[i],nums,i + 1);
generateSubsetSum(ans,sum,nums,i + 1);
}
Result is : {9 4 6 1 8 3 5 0} for the set {1,3,5}
This simply picks the first number at the first index i adds it to the sum and recurses. Once it returns, the second branch follows, sum, without the nums[i] added. To memoize this you would have a cache to store sum at i.

I would do something like this (seems easier) [I wanted to put this in comment but can't write the shifting and removing an elem at a time - you might need a linked list]
1 3 5
3 5
-----
4 8
1 3 5
5
-----
6
1 3 5
3 5
5
------
9
Add 0 to the list in the end.
Another way to solve this is create a subset arrays of vector of elements then sum up each array's vector's data.
e.g
1 3 5 = {1, 3} + {1,5} + {3,5} + {1,3,5} after removing sets of single element.
Keep in mind that it is always easier said than done. A single tiny mistake along the implemented algorithm would take a lot of time in debug to find it out. =]]

There has to be a binary chop version, as well. This one is a bit heavy-handed and relies on that set of answers you mention to filter repeated results:
Split the list into 2,
and generate the list of sums for each half
by recursion:
the minimum state is either
2 entries, with 1 result,
or 3 entries with 3 results
alternatively, take it down to 1 entry with 0 results, if you insist
Then combine the 2 halves:
All the returned entries from both halves are legitimate results
There are 4 additional result sets to add to the output result by combining:
The first half inputs vs the second half inputs
The first half outputs vs the second half inputs
The first half inputs vs the second half outputs
The first half outputs vs the second half outputs
Note that the outputs of the two halves may have some elements in common, but they should be treated separately for these combines.
The inputs can be scrubbed from the returned outputs of each recursion if the inputs are legitimate final results. If they are they can either be added back in at the top-level stage or returned by the bottom level stage and not considered again in the combining.
You could use a bitfield instead of a set to filter out the duplicates. There are reasonably efficient ways of stepping through a bitfield to find all the set bits. The max size of the bitfield is the sum of all the inputs.
There is no intelligence here, but lots of opportunity for parallel processing within the recursion and combine steps.

How to erase elements more efficiently from a vector or set?

Problem statement:
Input:
First two inputs are integers n and m. n is the number of knights fighting in the tournament (2 <= n <= 100000, 1 <= m <= n-1). m is the number of battles that will take place.
The next line contains n power levels.
The next m lines contain two integers l and r, indicating the range of knight positions to compete in the ith battle.
After each battle, all nights apart from the one with the highest power level will be eliminated.
The range for each battle is given in terms of the new positions of the knights, not the original positions.
Output:
Output m lines, the ith line containing the original positions (indices) of the knights from that battle. Each line is in ascending order.
Sample Input:
8 4
1 0 5 6 2 3 7 4
1 3
2 4
1 3
0 1
Sample Output:
1 2
4 5
3 7
0
Here is a visualisation of this process.
1 2
[(1,0),(0,1),(5,2),(6,3),(2,4),(3,5),(7,6),(4,7)]
-----------------
4 5
[(1,0),(6,3),(2,4),(3,5),(7,6),(4,7)]
-----------------
3 7
[(1,0),(6,3),(7,6),(4,7)]
-----------------
0
[(1,0),(7,6)]
-----------
[(7,6)]
I have solved this problem. My program produces the correct output, however, it is O(n*m) = O(n^2). I believe that if I erase knights more efficiently from the vector, efficiency can be increased. Would it be more efficient to erase elements using a set? I.e. erase contiguous segments rather that individual knights. Is there an alternative way to do this that is more efficient?
#define INPUT1(x) scanf("%d", &x)
#define INPUT2(x, y) scanf("%d%d", &x, &y)
#define OUTPUT1(x) printf("%d\n", x);
int main(int argc, char const *argv[]) {
int n, m;
INPUT2(n, m);
vector< pair<int,int> > knights(n);
for (int i = 0; i < n; i++) {
int power;
INPUT(power);
knights[i] = make_pair(power, i);
}
while(m--) {
int l, r;
INPUT2(l, r);
int max_in_range = knights[l].first;
for (int i = l+1; i <= r; i++) if (knights[i].first > max_in_range) {
max_in_range = knights[i].first;
}
int offset = l;
int range = r-l+1;
while (range--) {
if (knights[offset].first != max_in_range) {
OUTPUT1(knights[offset].second));
knights.erase(knights.begin()+offset);
}
else offset++;
}
printf("\n");
}
}

Well, removing from vector wouldn't be efficient for sure. Removing from set, or unordered set would be more effective (use iterators instead of indexes).
Yet the problem will still remain O(n^2), because you have two nested whiles running n*m times.
--EDIT--
I believe I understand the question now :)
First let's calculate the complexity of your code above. Your worst case would be the case that max range in all battles is 1 (two nights for each battle) and the battles are not ordered with respect to the position. Which means you have m battles (in this case m = n-1 ~= O(n))
The first while loop runs n times
For runs for once every time which makes it n*1 = n in total
The second while loop runs once every time which makes it n again.
Deleting from vector means n-1 shifts that makes it O(n).
Thus with the complexity of the vector total complexity is O(n^2)
First of all, you don't really need the inner for loop. Take the first knight as the max in range, compare the rest in the range one-by-one and remove the defeated ones.
Now, i believe it can be done in O(nlogn) with using std::map. The key to the map is the position and the value is the level of the knight.
Before proceeding, finding and removing an element in map is logarithmic, iterating is constant.
Finally, your code should look like:
while(m--) // n times
strongest = map.find(first_position); // find is log(n) --> n*log(n)
for (opponent = next of strongest; // this will run 1 times, since every range is 1
opponent in range;
opponent = next opponent) // iterating is constant
// removing from map is log(n) --> n * 1 * log(n)
if strongest < opponent
remove strongest, opponent is the new strongest
else
remove opponent, (be careful to remove it after iterating to next)
Ok, now the upper bound would be O(2*nlogn) = O(nlogn). If the ranges increases, that makes the run time of upper loop decrease but increases the number of remove operations. I'm sure the upper bound won't change, let's make it a homework for you to calculate :)

A solution with a treap is pretty straightforward.
For each query, you need to split the treap by implicit key to obtain the subtree that corresponds to the [l, r] range (it takes O(log n) time).
After that, you can iterate over the subtree and find the knight with the maximum strength. After that, you just need to merge the [0, l) and [r + 1, end) parts of the treap with the node that corresponds to this knight.
It's clear that all parts of the solution except for the subtree traversal and printing work in O(log n) time per query. However, each operation reinserts only one knight and erase the rest from the range, so the size of the output (and the sum of sizes of subtrees) is linear in n. So the total time complexity is O(n log n).
I don't think you can solve with standard stl containers because there'no standard container that supports getting an iterator by index quickly and removing arbitrary elements.

Varying initializer in a 'for loop' in C++

int i = 0;
for(; i<size-1; i++) {
int temp = arr[i];
arr[i] = arr[i+1];
arr[i+1] = temp;
}
Here I started with the fist position of array. What if after the loop I need to execute the for loop again where the for loop starts with the next position of array.
Like for first for loop starts from: Array[0]
Second iteration: Array[1]
Third iteration: Array[2]
Example:
For array: 1 2 3 4 5
for i=0: 2 1 3 4 5, 2 3 1 4 5, 2 3 4 1 5, 2 3 4 5 1
for i=1: 1 3 2 4 5, 1 3 4 2 5, 1 3 4 5 2 so on.

You can nest loops inside each other, including the ability for the inner loop to access the iterator value of the outer loop. Thus:
for(int start = 0; start < size-1; start++) {
for(int i = start; i < size-1; i++) {
// Inner code on 'i'
}
}
Would repeat your loop with an increasing start value, thus repeating with a higher initial value for i until you're gone through your list.

Suppose you have a routine to generate all possible permutations of the array elements for a given length n. Suppose the routine, after processing all n! permutations, leaves the n items of the array in their initial order.
Question: how can we build a routine to make all possible permutations of an array with (n+1) elements?
Answer:
Generate all permutations of the initial n elements, each time process the whole array; this way we have processed all n! permutations with the same last item.
Now, swap the (n+1)-st item with one of those n and repeat permuting n elements – we get another n! permutations with a new last item.
The n elements are left in their previous order, so put that last item back into its initial place and choose another one to put at the end of an array. Reiterate permuting n items.
And so on.
Remember, after each call the routine leaves the n-items array in its initial order. To retain this property at n+1 we need to make sure the same element gets finally placed at the end of an array after the (n+1)-st iteration of n! permutations.
This is how you can do that:
void ProcessAllPermutations(int arr[], int arrLen, int permLen)
{
if(permLen == 1)
ProcessThePermutation(arr, arrLen); // print the permutation
else
{
int lastpos = permLen - 1; // last item position for swaps
for(int pos = lastpos; pos >= 0; pos--) // pos of item to swap with the last
{
swap(arr[pos], arr[lastpos]); // put the chosen item at the end
ProcessAllPermutations(arr, arrLen, permLen - 1);
swap(arr[pos], arr[lastpos]); // put the chosen item back at pos
}
}
}
and here is an example of the routine running: https://ideone.com/sXp35O
Note, however, that this approach is highly ineffective:
It may work in a reasonable time for very small input size only. The number of permutations is a factorial function of the array length, and it grows faster than exponentially, which makes really BIG number of tests.
The routine has no short return. Even if the first or second permutation is the correct result, the routine will perform all the rest of n! unnecessary tests, too. Of course one can add a return path to break iteration, but that would make the code somewhat ugly. And it would bring no significant gain, because the routine will have to make n!/2 test on average.
Each generated permutation appears deep in the last level of the recursion. Testing for a correct result requires making a call to ProcessThePermutation from within ProcessAllPermutations, so it is difficult to replace the callee with some other function. The caller function must be modified each time you need another method of testing / procesing / whatever. Or one would have to provide a pointer to a processing function (a 'callback') and push it down through all the recursion, down to the place where the call will happen. This might be done indirectly by a virtual function in some context object, so it would look quite nice – but the overhead of passing additional data down the recursive calls can not be avoided.
The routine has yet another interesting property: it does not rely on the data values. Elements of the array are never compared. This may sometimes be an advantage: the routine can permute any kind of objects, even if they are not comparable. On the other hand it can not detect duplicates, so in case of equal items it will make repeated results. In a degenerate case of all n equal items the result will be n! equal sequences.
So if you ask how to generate all permutations to detect a sorted one, I must answer: DON'T.
Do learn effective sorting algorithms instead.

2 player team knowing maximum moves

Given a list of N players who are to play a 2 player game. Each of them are either well versed in making a particular move or they are not. Find out the maximum number of moves a 2-player team can know.
And also find out how many teams can know that maximum number of moves?
Example Let we have 4 players and 5 moves with ith player is versed in jth move if a[i][j] is 1 otherwise it is 0.
10101
11100
11010
00101
Here maximum number of moves a 2-player team can know is 5 and their are two teams that can know that maximum number of moves.
Explanation : (1, 3) and (3, 4) know all the 5 moves. So the maximal moves a 2-player team knows is 5, and only 2 teams can acheive this.
My approach : For each pair of players i check if any of the players is versed in ith move or not and for each player maintain the maximum pairs he can make with other players with his local maximum move combination.
vector<int> pairmemo;
for(int i=0;i<n;i++){
int mymax=INT_MIN;
int countpairs=0;
for(int j=i+1;j<n;j++){
int count=0;
for(int k=0;k<m;k++){
if(arr[i][k]==1 || arr[j][k]==1)
{
count++;
}
}
if(mymax<count){
mymax=count;
countpairs=0;
}
if(mymax==count){
countpairs++;
}
}
pairmemo.push_back(countpairs);
maxmemo.push_back(mymax);
}
Overall maximum of all N players is answer and count is corresponding sum of the pairs being calculated.
for(int i=0;i<n;i++){
if(maxi<maxmemo[i])
maxi=maxmemo[i];
}
int countmaxi=0;
for(int i=0;i<n;i++){
if(maxmemo[i]==maxi){
countmaxi+=pairmemo[i];
}
}
cout<<maxi<<"\n";
cout<<countmaxi<<"\n";
Time complexity : O((N^2)*M)
Code :
How can i improve it?
Constraints : N<= 3000 and M<=1000

If you represent each set of moves by a very large integer, the problem boils down to finding pair of players (I, J) which have maximum number of bits set in MovesI OR MovesJ.
So, you can use bit-packing and compress all the information on moves in Long integer array. It would take 16 unsigned long integers to store according to the constraints. So, for each pair of players you OR the corresponding arrays and count number of ones. This would take O(N^2 * 16) which would run pretty fast given the constraints.
Example:
Lets say given matrix is
11010
00011
and you used 4-bit integer for packing it.
It would look like:
1101-0000
0001-1000
that is,
13,0
1,8
After OR the moves array for 2 player team becomes 13,8, now count the bits which are one. You have to optimize the counting of bits also, for that read the accepted answer here, otherwise the factor M would appear in complexity. Just maintain one count variable and one maxNumberOfBitsSet variable as you process the pairs.

What Ill do is:
1. Do logical OR between all the possible pairs - O(N^2) and store it's SUM in a 2D array with the symmetric diagonal ignored. (thats we save half of the calc - see example)
2. find the max value in the 2D Array (can be done while doing task 1) -> O(1)
3. count how many cells in the 2D array equals to the maximum value in task 2 O(N^2)
sum: 2*O(N^2)+ O(1) => O(N^2)
Example (using the data in the question (with letters indexes):
A[10101] B[11100] C[11010] D[00101]
Task 1:
[A|B] = 11101 = SUM(4)
[A|C] = 11111 = SUM(5)
[A|D] = 10101 = SUM(3)
[B|C] = 11110 = SUM(4)
[B|D] = 11101 = SUM(4)
[C|D] = 11111 = SUM(5)
Task 2 (Done while is done 1):
Max = 5
Task 3:
Count = 2
By the way, O(N^2) is the minimum possible since you HAVE to check all the possible pairs.

Since you have to find all solutions, unless you find a way to find a count without actually finding the solutions themselves, you have to actually look at or eliminate all possible solutions. So the worst case will always be O(N^2*M), which I'll call O(n^3) as long as N and M are both big and similar size.
However, you can hope for much better performance on the average case by pruning.
Don't check every case. Find ways to eliminate combinations without checking them.
I would sum and store the total number of moves known to each player, and sort the array rows by that value. That should provide an easy check for exiting the loop early. Sorting at O(n log n) should be basically free in an O(n^3) algorithm.
Use Priyank's basic idea, except with bitsets, since you obviously can't use a fixed integer type with 3000 bits.
You may benefit from making a second array of bitsets for the columns, and use that as a mask for pruning players.

What are practical uses for STL's 'partial_sum'?

What/where are the practical uses of the partial_sum algorithm in STL?
What are some other interesting/non-trivial examples or use-cases?

I used it to reduce memory usage of a simple mark-sweep garbage collector in my toy lambda calculus interpreter.
The GC pool is an array of objects of identical size. The goal is to eliminate objects that aren't linked to other objects, and condense the remaining objects into the beginning of the array. Since the objects are moved in memory, each link needs to be updated. This necessitates an object remapping table.
partial_sum allows the table to be stored in compressed format (as little as one bit per object) until the sweep is complete and memory has been freed. Since the objects are small, this significantly reduces memory use.
Recursively mark used objects and populate the Boolean array.
Use remove_if to condense the marked objects to the beginning of the pool.
Use partial_sum over the Boolean values to generate a table of pointers/indexes into the new pool.
This works because the Nth marked object has N preceding 1's in the array and acquires pool index N.
Sweep over the pool again and replace each link using the remap table.
It's especially friendly to the data cache to put the remap table in the just-freed, thus still hot, memory.

One thing to note about partial sum is that it is the operation that undoes adjacent difference much like - undoes +. Or better yet if you remember calculus the way differentiation undoes integration. Better because adjacent difference is essentially differentiation and partial sum is integration.
Let's say you have simulation of a car and at each time step you need to know the position, velocity, and acceleration. You only need to store one of those values as you can compute the other two. Say you store the position at each time step you can take the adjacent difference of the position to give the velocity and the adjacent difference of the velocity to give the acceleration. Alternatively, if you store the acceleration you can take the partial sum to give the velocity and the partial sum of the velocity gives the position.
Partial sum is one of those functions that doesn't come up too often for most people but is enormously useful when you find the right situation. A lot like calculus.

Last time I (would have) used it is when converting a discrete probability distribution (an array of p(X = k)) into a cumulative distribution (an array of p(X <= k)). To select once from the distribution, you can pick a number from [0-1) randomly, then binary search into the cumulative distribution.
That code wasn't in C++, though, so I did the partial sum myself.

You can use it to generate a monotonically increasing sequence of numbers. For example, the following generates a vector containing the numbers 1 through 42:
std::vector<int> v(42, 1);
std::partial_sum(v.begin(), v.end(), v.begin());
Is this an everyday use case? Probably not, though I've found it useful on several occasions.
You can also use std::partial_sum to generate a list of factorials. (This is even less useful, though, since the number of factorials that can be represented by a typical integer data type is quite limited. It is fun, though :-D)
std::vector<int> v(10, 1);
std::partial_sum(v.begin(), v.end(), v.begin());
std::partial_sum(v.begin(), v.end(), v.begin(), std::multiplies<int>());

Personal Use Case: Roulette-Wheel-Selection
I'm using partial_sum in a roulette-wheel-selection algorithm (link text). This algorithm choses randomly elements from a container with a probability which is linear to some value given beforehands.
Because all my elements to choose from bringing a not-necessarily normalized value, I use the partial_sum algorithm for constructing something like a "roulette-wheel", because I sum up all the elements. Then I chose a random variable in this range (the last partial_sum is the sum of all) and use stl::lower_bound for searching "the wheel" where my random search landed. The element returned by the lower_bound algorithm is the chosen one.
Besides the advantage of clear and expressive code with the use of partial_sum, I could also gain some speed when experimenting with the GCC parallel mode which brings parallelized versions for some algorithms and one of them is the partial_sum (link text).
Another use I know of: One of the most important algorithmic primitives in parallel processing (but maybe a little bit away from STL)
If you're interested in heavy optimized algorithms which are using partial_sum (in this case maybe more results under the synonyms "scan" or "prefix_sum"), than go to the parallel algorithms community. They need it all the time. You won't find a parallel sorting algorithm based on quicksort or mergesort without using it. This operation is one of the most important parallel primitives used. I think it is most commonly used for calculating offsets in dynamic algorithms. Think of a partition step in quicksort, which is split and fed to the parallel threads. You don't know the number of elements in each slot of the partition before calculating it. So you need some offsets for all the threads for later access.
Maybe you will find more informatin in the now-hot topic of GPU processing. One short article regarding Nvidia's CUDA and the scan-primitive with a few application examples you will find in Chapter 39. Parallel Prefix Sum (Scan) with CUDA.

Personal Use Case: intermediate step in counting sort from CLRS:
COUNTING_SORT (A, B, k)
for i ← 1 to k do
c[i] ← 0
for j ← 1 to n do
c[A[j]] ← c[A[j]] + 1
//c[i] now contains the number of elements equal to i
// std::partial_sum here
for i ← 2 to k do
c[i] ← c[i] + c[i-1]
// c[i] now contains the number of elements ≤ i
for j ← n downto 1 do
B[c[A[i]]] ← A[j]
c[A[i]] ← c[A[j]] - 1

You could build a "moving sum" (precursor to a moving average):
template <class T>
void moving_sum (const vector<T>& in, int num, vector<T>& out)
{
// cummulative sum
partial_sum (in.begin(), in.end(), out.begin());
// shift and subtract
int j;
for (int i = out.size() - 1; i >= 0; i--) {
j = i - num;
if (j >= 0)
out[i] -= out[j];
}
}
And then call it with:
vector<double> v(10);
// fill in v
vector<double> v2 (v.size());
moving_sum (v, 3, v2);

You know, I actually did use partial_sum() once... It was this interesting little problem that I was asked on a job interview. I enjoyed it so much, I went home and coded it up.
The problem was: Given a sequential sequence of integers, find the shortest sub-sequence with the highest value. E.g. Given:
Value: -1 2 3 -1 4 -2 -4 5
Index: 0 1 2 3 4 5 6 7
We would find the subsequence [1,4]
Now the obvious solution is to run with 3 for loops, iterating over all possible starts & ends, and adding up the value of each possible subsequence in turn. Inefficient, but quick to code up and hard to make mistakes. (Especially when the third for loop is just accumulate(start,end,0).)
The correct solution involves a divide-and-conquer / bottom up approach. E.g. Divide the problem space in half, and for each half compute the largest subsequence contained within that section, the largest subsequence including the starting number, the largest subsequence including the ending number, and the entire section's subsequence. Armed with this data we can then combine the two halves together without any further evaluation of either one. Obviously the data for each half can be computed by further breaking each half into halves (quarters), each quarter into halves (eighths), and so on until we have trivial singleton cases. It's all quite efficient.
But all that aside, there's a third (somewhat less efficient) option that I wanted to explore. It's similar to the 3-for-loop case, only we add the adjacent numbers to avoid so much work. The idea is that there's no need to add a+b, a+b+c, and a+b+c+d when we can add t1=a+b, t2=t1+c, and t3=t2+d. It's a space/computation tradeoff thing. It works by transforming the sequence:
Index: 0 1 2 3 4
FROM: 1 2 3 4 5
TO: 1 3 6 10 15
Thereby giving us all possible substrings starting at index=0 and ending at indexes=0,1,2,3,4.
Then we iterate over this set subtracting the successive possible "start" points...
FROM: 1 3 6 10 15
TO: - 2 5 9 14
TO: - - 3 7 12
TO: - - - 4 9
TO: - - - - 5
Thereby giving us the values (sums) of all possible subsequences.
We can find the maximum value of each iteration via max_element().
The first step is most easily accomplished via partial_sum().
The remaining steps via a for loop and transform(data+i,data+size,data+i,bind2nd(minus<TYPE>(),data[i-1])).
Clearly O(N^2). But still interesting and fun...

Partial sums are often useful in parallel algorithms. Consider the code
for (int i=0; N>i; ++i) {
sum += x[i];
do_something(sum);
}
If you want to parallelise this code, you need to know the partial sums. I am using GNUs parallel version of partial_sum for something very similar.

I often use partial sum not to sum but to calculate the current value in the sequence depending on the previous.
For example, if you integrate a function. Each new step is a previous step, vt += dvdt or vt = integrate_step(dvdt, t_prev, t_prev+dt);.

In nonparametric Bayesian methods there is a Metropolis-Hastings step (per observation) that determines to sample a new or an existing cluster. If an existing cluster has to be sampled this needs to be done with different weights. These weighted likelihoods are simulated in the following example code.
#include <random>
#include <iostream>
#include <algorithm>
int main() {
std::default_random_engine generator(std::random_device{}());
std::uniform_real_distribution<double> distribution(0.0,1.0);
int K = 8;
std::vector<double> weighted_likelihood(K);
for (int i = 0; i < K; ++i) {
weighted_likelihood[i] = i*10;
}
std::cout << "Weighted likelihood: ";
for (auto i: weighted_likelihood) std::cout << i << ' ';
std::cout << std::endl;
std::vector<double> cumsum_likelihood(K);
std::partial_sum(weighted_likelihood.begin(), weighted_likelihood.end(), cumsum_likelihood.begin());
std::cout << "Cumulative sum of weighted likelihood: ";
for (auto i: cumsum_likelihood) std::cout << i << ' ';
std::cout << std::endl;
std::vector<int> frequency(K);
int N = 280000;
for (int i = 0; i < N; ++i) {
double pick = distribution(generator) * cumsum_likelihood.back();
auto lower = std::lower_bound(cumsum_likelihood.begin(), cumsum_likelihood.end(), pick);
int index = std::distance(cumsum_likelihood.begin(), lower);
frequency[index]++;
}
std::cout << "Frequencies: ";
for (auto i: frequency) std::cout << i << ' ';
std::cout << std::endl;
}
Note that this is not different from the answer by https://stackoverflow.com/users/13005/steve-jessop. It's added to give a bit more context about a particular situation (nonparametric Bayesian mehods, e.g. the algorithms by Neal using the Dirichlet process as prior) and the actual code which uses partial_sum in combination with lower_bound.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js