How to find best matching for all columns of a 2D array?

How to find best matching for all columns of a 2D array? - c++

Let's say that I have a 2D array that looks like:
________________
|10|15|14|20|30|
|14|10|73|71|55|
|73|30|42|84|74|
|14|74|XX|15|10|
----------------
As I showed, the columns don't need to be same size.
Now I need to find the best matching for each column (the one that has most exactly the same items and lowest different). Of course, I could do that in n^2 but it's too slow for me. How can I do it?
I thought about a k-dimension tree and finding the closest neighbor for every one, but I don't know if it's good and it will work as I want (probably not).
Result for example:
First column is most likely third (only three different - 10, 14, 42)
Second column -> fifth (only two different - 15 and 55)
and so on and so on... :)

If you know that all the numbers in the table are 2-digit numbers (i.e. 10 =< x <100), for each column create an array of booleans where you will mark the existing numbers:
bool array[5][100];
std::fill( &array[0][0], &array[0][0] + sizeof(array) , false ); // init to false
for (int i = 0; i < 5; i++)
{
for (int j = 0; j <5; j++)
{
array[i][table[i][j]] = true;
}
}
Should be easy from there.

Related

Perfect sum problem with fixed subset size

I am looking for a least time-complex algorithm that would solve a variant of the perfect sum problem (initially: finding all variable size subset combinations from an array [*] of integers of size n that sum to a specific number x) where the subset combination size is of a fixed size k and return the possible combinations without direct and also indirect (when there's a combination containing the exact same elements from another in another order) duplicates.
I'm aware this problem is NP-hard, so I am not expecting a perfect general solution but something that could at least run in a reasonable time in my case, with n close to 1000 and k around 10
Things I have tried so far:
Finding a combination, then doing successive modifications on it and its modifications
Let's assume I have an array such as:
s = [1,2,3,3,4,5,6,9]
So I have n = 8, and I'd like x = 10 for k = 3
I found thanks to some obscure method (bruteforce?) a subset [3,3,4]
From this subset I'm finding other possible combinations by taking two elements out of it and replacing them with other elements that sum the same, i.e. (3, 3) can be replaced by (1, 5) since both got the same sum and the replacing numbers are not already in use. So I obtain another subset [1,5,4], then I repeat the process for all the obtained subsets... indefinitely?
The main issue as suggested here is that it's hard to determine when it's done and this method is rather chaotic. I imagined some variants of this method but they really are work in progress
Iterating through the set to list all k long combinations that sum to x
Pretty self explanatory. This is a naive method that do not work well in my case since I have a pretty large n and a k that is not small enough to avoid a catastrophically big number of combinations (the magnitude of the number of combinations is 10^27!)
I experimented several mechanism related to setting an area of research instead of stupidly iterating through all possibilities, but it's rather complicated and still work in progress
What would you suggest? (Snippets can be in any language, but I prefer C++)
[*] To clear the doubt about whether or not the base collection can contain duplicates, I used the term "array" instead of "set" to be more precise. The collection can contain duplicate integers in my case and quite much, with 70 different integers for 1000 elements (counts rounded), for example

With reasonable sum limit this problem might be solved using extension of dynamic programming approach for subset sum problem or coin change problem with predetermined number of coins. Note that we can count all variants in pseudopolynomial time O(x*n), but output size might grow exponentially, so generation of all variants might be a problem.
Make 3d array, list or vector with outer dimension x-1 for example: A[][][]. Every element A[p] of this list contains list of possible subsets with sum p.
We can walk through all elements (call current element item) of initial "set" (I noticed repeating elements in your example, so it is not true set).
Now scan A[] list from the last entry to the beginning. (This trick helps to avoid repeating usage of the same item).
If A[i - item] contains subsets with size < k, we can add all these subsets to A[i] appending item.
After full scan A[x] will contain subsets of size k and less, having sum x, and we can filter only those of size k
Example of output of my quick-made Delphi program for the next data:
Lst := [1,2,3,3,4,5,6,7];
k := 3;
sum := 10;
3 3 4
2 3 5 //distinct 3's
2 3 5
1 4 5
1 3 6
1 3 6 //distinct 3's
1 2 7
To exclude variants with distinct repeated elements (if needed), we can use non-first occurence only for subsets already containing the first occurence of item (so 3 3 4 will be valid while the second 2 3 5 won't be generated)
I literally translate my Delphi code into C++ (weird, I think :)
int main()
{
vector<vector<vector<int>>> A;
vector<int> Lst = { 1, 2, 3, 3, 4, 5, 6, 7 };
int k = 3;
int sum = 10;
A.push_back({ {0} }); //fictive array to make non-empty variant
for (int i = 0; i < sum; i++)
A.push_back({{}});
for (int item : Lst) {
for (int i = sum; i >= item; i--) {
for (int j = 0; j < A[i - item].size(); j++)
if (A[i - item][j].size() < k + 1 &&
A[i - item][j].size() > 0) {
vector<int> t = A[i - item][j];
t.push_back(item);
A[i].push_back(t); //add new variant including current item
}
}
}
//output needed variants
for (int i = 0; i < A[sum].size(); i++)
if (A[sum][i].size() == k + 1) {
for (int j = 1; j < A[sum][i].size(); j++) //excluding fictive 0
cout << A[sum][i][j] << " ";
cout << endl;
}
}

Here is a complete solution in Python. Translation to C++ is left to the reader.
Like the usual subset sum, generation of the doubly linked summary of the solutions is pseudo-polynomial. It is O(count_values * distinct_sums * depths_of_sums). However actually iterating through them can be exponential. But using generators the way I did avoids using a lot of memory to generate that list, even if it can take a long time to run.
from collections import namedtuple
# This is a doubly linked list.
# (value, tail) will be one group of solutions. (next_answer) is another.
SumPath = namedtuple('SumPath', 'value tail next_answer')
def fixed_sum_paths (array, target, count):
# First find counts of values to handle duplications.
value_repeats = {}
for value in array:
if value in value_repeats:
value_repeats[value] += 1
else:
value_repeats[value] = 1
# paths[depth][x] will be all subsets of size depth that sum to x.
paths = [{} for i in range(count+1)]
# First we add the empty set.
paths[0][0] = SumPath(value=None, tail=None, next_answer=None)
# Now we start adding values to it.
for value, repeats in value_repeats.items():
# Reversed depth avoids seeing paths we will find using this value.
for depth in reversed(range(len(paths))):
for result, path in paths[depth].items():
for i in range(1, repeats+1):
if count < i + depth:
# Do not fill in too deep.
break
result += value
if result in paths[depth+i]:
path = SumPath(
value=value,
tail=path,
next_answer=paths[depth+i][result]
)
else:
path = SumPath(
value=value,
tail=path,
next_answer=None
)
paths[depth+i][result] = path
# Subtle bug fix, a path for value, value
# should not lead to value, other_value because
# we already inserted that first.
path = SumPath(
value=value,
tail=path.tail,
next_answer=None
)
return paths[count][target]
def path_iter(paths):
if paths.value is None:
# We are the tail
yield []
else:
while paths is not None:
value = paths.value
for answer in path_iter(paths.tail):
answer.append(value)
yield answer
paths = paths.next_answer
def fixed_sums (array, target, count):
paths = fixed_sum_paths(array, target, count)
return path_iter(paths)
for path in fixed_sums([1,2,3,3,4,5,6,9], 10, 3):
print(path)
Incidentally for your example, here are the solutions:
[1, 3, 6]
[1, 4, 5]
[2, 3, 5]
[3, 3, 4]

You should first sort the so called array. Secondly, you should determine if the problem is actually solvable, to save time... So what you do is you take the last k elements and see if the sum of those is larger or equal to the x value, if it is smaller, you are done it is not possible to do something like that.... If it is actually equal yes you are also done there is no other permutations.... O(n) feels nice doesn't it?? If it is larger, than you got a lot of work to do..... You need to store all the permutations in an seperate array.... Then you go ahead and replace the smallest of the k numbers with the smallest element in the array.... If this is still larger than x then you do it for the second and third and so on until you get something smaller than x. Once you reach a point where you have the sum smaller than x, you can go ahead and start to increase the value of the last position you stopped at until you hit x.... Once you hit x that is your combination.... Then you can go ahead and get the previous element so if you had 1,1,5, 6 in your thingy, you can go ahead and grab the 1 as well, add it to your smallest element, 5 to get 6, next you check, can you write this number 6 as a combination of two values, you stop once you hit the value.... Then you can repeat for the others as well.... You problem can be solved in O(n!) time in the worst case.... I would not suggest that you 10^27 combinations, meaning you have more than 10^27 elements, mhmmm bad idea do you even have that much space??? That's like 3bits for the header and 8 bits for each integer you would need 9.8765*10^25 terabytes just to store that clossal array, more memory than a supercomputer, you should worry about whether your computer can even store this monster rather than if you can solve the problem, that many combinations even if you find a quadratic solution it would crash your computer, and you know what quadratic is a long way off from O(n!)...

A brute force method using recursion might look like this...
For example, given variables set, x, k, the following pseudo code might work:
setSumStructure find(int[] set, int x, int k, int setIdx)
{
int sz = set.length - setIdx;
if (sz < x) return null;
if (sz == x) check sum of set[setIdx] -> set[set.size] == k. if it does, return the set together with the sum, else return null;
for (int i = setIdx; i < set.size - (k - 1); i++)
filter(find (set, x - set[i], k - 1, i + 1));
return filteredSets;
}

Dynamic Programming w/ 1D array USACO Training: Subset Sums

While working through the USACO Training problems, I found out about Dynamic Programming. The first training problem that deals with this concept is a problem called Subset Sums.
The Problem Statement Follows:
For many sets of consecutive integers from 1 through N (1 <= N <= 39), one can partition the set into two sets whose sums are identical.
For example, if N=3, one can partition the set {1, 2, 3} in one way so that the sums of both subsets are identical:
{3} and {1,2}
This counts as a single partitioning (i.e., reversing the order counts as the same partitioning and thus does not increase the count of partitions).
If N=7, there are four ways to partition the set {1, 2, 3, ... 7} so that each partition has the same sum:
{1,6,7} and {2,3,4,5}
{2,5,7} and {1,3,4,6}
{3,4,7} and {1,2,5,6}
{1,2,4,7} and {3,5,6}
Given N, your program should print the number of ways a set containing the integers from 1 through N can be partitioned into two sets whose sums are identical. Print 0 if there are no such ways.
Your program must calculate the answer, not look it up from a table.
INPUT FORMAT
The input file contains a single line with a single integer representing N, as above.
SAMPLE INPUT (file subset.in)
7
OUTPUT FORMAT
The output file contains a single line with a single integer that tells how many same-sum partitions can be made from the set {1, 2, ..., N}. The output file should contain 0 if there are no ways to make a same-sum partition.
SAMPLE OUTPUT (file subset.out)
4
After much reading, I found an algorithm that was explained to be a variation of the 0/1 knapsack problem. I implemented it in my code, and I solved the problem. However, I have no idea how my code works or what is going on.
*Main Question: I was wondering if someone could explain to me how the knapsack algorithm works, and how my program could possibly be implementing this in my code?
My code:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
ifstream fin("subset.in");
ofstream fout("subset.out");
long long num=0, ways[800]={0};
ways[0]=1;
cin >> num;
if(((num*(num+1))/2)%2 == 1)
{
fout << "0" << endl;
return 0;
}
//THIS IS THE BLOCK OF CODE THAT IS SUPPOSED TO BE DERIVED FROM THE
// O/1 KNAPSACK PROBLEM
for (int i = 1; i <= num; i++)
{
for (int j = (num*(num+1))/2 - i; j >= 0; --j)
{
ways[j + i] += ways[j];
}
}
fout << ways[(num*(num+1))/2/2]/2 << endl;
return 0;
}
*note: Just to emphasize, this code does work, I just would like an explanation why it works. Thanks :)

I wonder why numerous sources could not help you.
Trying one more time with my ugly English:
ways[0]=1;
there is a single way to make empty sum
num*(num+1))/2
this is MaxSum - sum of all numbers in range 1..num (sum of arithmetic progression)
if(((num*(num+1))/2)%2 == 1)
there is no chance to divide odd value into two equal parts
for (int i = 1; i <= num; i++)
for every number in range
for (int j = (num*(num+1))/2 - i; j >= 0; --j)
ways[j + i] += ways[j];
sum j + i might be built using sum j and item with value i.
For example, consider that you want make sum 15.
At the first step of outer cycle you are using number 1, and there is ways[14] variants to make this sum.
At the second step of outer cycle you are using number 2, and there is ways[13] new variants to make this sum, you have to add these new variants.
At the third step of outer cycle you are using number 3, and there is ways[12] new variants to make this sum, you have to add these new variants.
ways[(num*(num+1))/2/2]/2
output number of ways to make MaxSum/2, and divide by two to exclude symmetric variants ([1,4]+[2,3]/[2,3]+[1,4])
Question for self-thinking: why inner cycle goes in reverse direction?

How do I exactly solve this GCD?

https://www.codechef.com/problems/MAXGCD
Chef has a set consisting of N integers. Chef calls a subset of this set to be good if the subset has two or more elements. He denotes all the good subsets as S1, S2, S3, ... , S2N-N-1. Now he represents the GCD of the elements of each good subset Si as Gi.
Chef wants to find the maximum Gi.
Input
The first line of the input contains an integer T denoting the number of test cases. The description of T test cases follows."
The first line of each test case contains a single integer N denoting the number of elements in the set. The second line contains N space-separated integers A1, A2, ..., AN denoting the elements of the set.
Output
For each test case, output the maximum Gi
My solution:
I generate all possible subsets of the given set.
I calculate the GCD of each set using Euclid's algorithm
I tried to find the maximum of all of them.
This is my code for making all possible subsets:
vector< vector<int> > getAllSubsets(vector<int> set)
{
vector< vector<int> > subset;
vector<int> empty;
subset.push_back( empty );
for (int i = 0; i < set.size(); i++)
{
vector< vector<int> > subsetTemp = subset;
for (int j = 0; j < subsetTemp.size(); j++)
subsetTemp[j].push_back( set[i] );
for (int j = 0; j < subsetTemp.size(); j++)
subset.push_back( subsetTemp[j] );
}
return subset;
}
However, I get TLE while going with this approach. Where am I going wrong in this?

One optimization is that you never need to consider subsets larger than 2 elements. This is because if you add another element, the GCD can only decrease.
This leads to an O(n^2) algorithm. The problem statement says that n can be as large as 100 000, so we need to do even better.
The problem also says that the given values are at most 500 000, so the GCD cannot exceed this.
Let count[i] = how many times the value i appears in the array.
Then we can apply something similar to the Sieve of Eratosthenes: for a fixed value v, see if you can find two multiples of v (sum of count[multiple_of_v] > 1). If you can, then you can have a GCD of v. Keep track of the max you can find.
Pseudocode:
V = max(given array)
cnt[i] = how many times value i occurs in given array
for v = V down to 1:
num_multiples_v = 0
for j = v up to V:
num_multiples_v += cnt[j]
if num_multiples_v > 1: # TODO: break the inner loop when this is true
print v as solution
return
Complexity will be O(V log log V), which should be very fast.

You don't need all subsets.
Some basic properties of gcd:
gcd(a,b) == gcd(b,a)
gcd(a,b) <= a
gcd(a,b) <= b
gcd(a,b,c) == gcd(a,gcd(b,c)) == gcd(gcd(a,b),c)
and with this, it's easy to show that
gcd(a,b) >= gcd(a,b,c) >= gcd(a,b,c,d)...
for any natural numbers a,b,c,d.
You want to find the (one of the) subsets with the max. gcd. According to the rules above, one of this subsets has exactly two elements (given that the whole set has at least two elements). So the first optimization is to throw the subset generation away and make something like
max = 0
for all set elements "a"
{
for all set elements "b"
{
if(gcd(a,b) > max)
max = gcd(a,b)
}
}
If that is still not enough, sort the set form the largest to the smallest element first, and for each gcd calculated in the loops, delete every set element smaller than the calculated value.

Trying to multiply the kiddy way

I'm supposed to multiply two 3-digit numbers the way we used to do in childhood.
I need to multiply each digit of a number with each of the other number's digit, calculate the carry, add the individual products and store the result.
I was able to store the 3 products obtained (for I/P 234 and 456):
1404
1170
0936
..in a 2D array.
Now when I try to arrange them in the following manner:
001404
011700
093600
to ease addition to get the result; by:
for(j=5;j>1;j--)
{
xx[0][j]=xx[0][j-2];
}
for(j=4;j>0;j--)
{
xx[1][j]=xx[1][j-1];
}
xx is the 2D array I've stored the 3 products in.
everything seems to be going fine till I do this:
xx[0][0]=0;
xx[0][1]=0;
xx[1][0]=0;
Here's when things go awry. The values get all mashed up. On printing, I get 001400 041700 093604.
What am I doing wrong?

Assuming the first index of xx is the partial sum, that the second index is the digit in that sum, and that the partial sums are stored with the highest digit at the lowest index,
for (int i = 0; i < NUM_DIGITS; i++) // NUM_DIGITS = number of digits in multiplicands
{
for (int j = 5; j >= 0; j--) // Assuming 5 is big enough
{
int index = (j - 1) - (NUM_DIGITS - 1) - i;
xx[i][j] = index >= 0 ? xx[i][index] : 0;
}
}
There are definitely more efficient/logical ways of doing this, of course, such as avoiding storing the digits individually, but within the constraints of the problem, this should give you the right answer.

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?

Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.

If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.

You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js