how to find distinct substrings? - c++

Given a string, and a fixed length l, how can I count the number of distinct substrings whose length is l?
The size of character set is also known. (denote it as s)
For example, given a string "PccjcjcZ", s = 4, l = 3,
then there are 5 distinct substrings:
“Pcc”; “ccj”; “cjc”; “jcj”; “jcZ”
I try to use hash table, but the speed is still slow.
In fact I don't know how to use the character size.
I have done things like this
int diffPatterns(const string& src, int len, int setSize) {
int cnt = 0;
node* table[1 << 15];
int tableSize = 1 << 15;
for (int i = 0; i < tableSize; ++i) {
table[i] = NULL;
}
unsigned int hashValue = 0;
int end = (int)src.size() - len;
for (int i = 0; i <= end; ++i) {
hashValue = hashF(src, i, len);
if (table[hashValue] == NULL) {
table[hashValue] = new node(i);
cnt ++;
} else {
if (!compList(src, i, table[hashValue], len)) {
cnt ++;
};
}
}
for (int i = 0; i < tableSize; ++i) {
deleteList(table[i]);
}
return cnt;
}

Hastables are fine and practical, but keep in mind that if the length of substrings is L, and the whole string length is N, then the algorithm is Theta((N+1-L)*L) which is Theta(NL) for most L. Remember, just computing the hash takes Theta(L) time. Plus there might be collisions.
Suffix trees can be used, and provide a guaranteed O(N) time algorithm (count number of paths at depth L or greater), but the implementation is complicated. Saving grace is you can probably find off the shelf implementations in the language of your choice.

The idea of using a hashtable is good. It should work well.
The idea of implementing your own hashtable as an array of length 2^15 is bad. See Hashtable in C++? instead.

You can use an unorder_set and insert the strings into the set and then get the size of the set. Since the values in a set are unique it will take care of not including substrings that are the same as ones previously found. This should give you close to O(StringSize - SubstringSize) complexity
#include <iostream>
#include <string>
#include <unordered_set>
int main()
{
std::string test = "PccjcjcZ";
std::unordered_set<std::string> counter;
size_t substringSize = 3;
for (size_t i = 0; i < test.size() - substringSize + 1; ++i)
{
counter.insert(test.substr(i, substringSize));
}
std::cout << counter.size();
std::cin.get();
return 0;
}

Veronica Kham answered good to the question, but we can improve this method to expected O(n) and still use a simple hash table rather than suffix tree or any other advanced data structure.
Hash function
Let X and Y are two adjacent substrings of length L, more precisely:
X = A[i, i + L - 1]
Y = B[i + 1, i + 1 + L - 1]
Let assign to each letter of our alphabet a single non negative integer, for example a := 1, b := 2 and so on.
Let's define a hash function h now:
h(A[i, j]) := (P^(L-1) * A[i] + P^(L-2) * A[i + 1] + ... + A[j]) % M
where P is a prime number ideally greater than the alphabet size and M is a very big number denoting the number of different possible hashes, for example you can set M to maximum available unsigned long long int in your system.
Algorithm
The crucial observation is the following:
If you have a hash computed for X, you can compute a hash for Y in
O(1) time.
Let assume that we have computed h(X), which can be done in O(L) time obviously. We want to compute h(Y). Notice that since X and Y differ by only 2 characters, and we can do that easily using addition and multiplication:
h(Y) = ((h(X) - P^L * A[i]) * P) + A[j + 1]) % M
Basically, we are subtracting letter A[i] multiplied by its coefficient in h(X), multiplying the result by P in order to get proper coefficients for the rest of letters and at the end, we are adding the last letter A[j + 1].
Notice that we can precompute powers of P at the beginning and we can do it modulo M.
Since our hashing functions returns integers, we can use any hash table to store them. Remember to make all computations modulo M and avoid integer overflow.
Collisions
Of course, there might occur a collision, but since P is prime and M is really huge, it is a rare situation.
If you want to lower the probability of a collision, you can use two different hashing functions, for example by using different modulo in each of them. If probability of a collision is p using one such function, then for two functions it is p^2 and we can make it arbitrary small by this trick.

Use Rolling hashes.
This will make the runtime expected O(n).
This might be repeating pkacprzak's answer, except, it gives a name for easier remembrance etc.

Suffix Automaton also can finish it in O(N).
It's easy to code, but hard to understand.
Here are papers about it http://dl.acm.org/citation.cfm?doid=375360.375365
http://www.sciencedirect.com/science/article/pii/S0304397509002370

Related

Given an integer K and a matrix of size t x t. construct a string s consisting of first t lowercase english letters such that the total cost of s is K

I'm solving this problem and stuck halfway through, looking for help and a better method to tackle such a problem:
problem:
Given an integer K and a matrix of size t x t. we have to construct a string s consisting of the first t lowercase English letters such that the total cost of s is exactly K. it is guaranteed that there exists at least one string that satisfies given conditions. Among all possible string s which is lexicographically smallest.
Specifically the cost of having the ith character followed by jth character of the English alphabet is equal to cost[i][j].
For example, the cost of having 'a' followed by 'a' is denoted by cost[0][0] and the cost of having 'b' followed by 'c' is denoted by cost[1][3].
The total cost of a string is the total cost of two consecutive characters in s. for matrix cost is
[1 2]
[3 4],
and the string is "abba", then we have
the cost of having 'a' followed by 'b' is is cost[0][1]=2.
the cost of having 'b' followed by 'b' is is `cost0=4.
the cost of having 'b' followed by 'a' is cost0=3.
In total, the cost of the string "abba" is 2+4+3=9.
Example:
consider, for example, K is 3,t is 2, the cost matrix is
[2 1]
[3 4]
There are two strings that its total cost is 3. Those strings are:
"aab"
"ba"
our answer will be "aab" as it is lexicographically smallest.
my approach
I tried to find and store all those combinations of i, j such that it sums up to desired value k or is individual equals k.
for above example
v={
{2,1},
{3,4}
}
k = 3
and v[0][0] + v[0][1] = 3 & v[1][0] = 3 . I tried to store the pairs in an array like this std::vector<std::vector<std::pair<int, int>>>. and based on it i will create all possible strings and will store in the set and it will give me the strings in lexicographical order.
i stucked by writing this much code:
#include<iostream>
#include<vector>
int main(){
using namespace std;
vector<vector<int>>v={{2,1},{3,4}};
vector<pair<int,int>>k;
int size=v.size();
for(size_t i=0;i<size;i++){
for(size_t j=0;j<size;j++){
if(v[i][j]==3){
k.push_back(make_pair(i,j));
}
}
}
}
please help me how such a problem can be tackled, Thank you. My code can only find the individual [i,j] pairs that can be equal to desired K. I don't have idea to collect multiple [i,j] pairs which sum's to desired value and it also appears my approach is totally naive and based on brute force. Looking for better perception to solve the problems and implement it in the code. Thank you.
This is a backtracking problem. General approach is :
a) Start with the "smallest" letter for e.g. 'a' and then recurse on all the available letters. If you find a string that sums to K then you have the answer because that will be the lexicographically smallest as we are finding it from smallest to largest letter.
b) If not found in 'a' move to the next letter.
Recurse/backtrack can be done as:
Start with a letter and the original value of K
explore for every j = 0 to t and reducing K by cost[i][j]
if K == 0 you found your string.
if K < 0 then that path is not possible, so remove the last letter in the string, try other paths.
Pseudocode :
string find_smallest() {
for (int i = 0; i < t; i++) {
s = (char)(i+97)
bool value = recurse(i,t,K,s)
if ( value ) return s;
s = ""
}
return ""
}
bool recurse(int i, int t, int K, string s) {
if ( K < 0 ) {
return false;
}
if ( K == 0 ) {
return true;
}
for ( int j = 0; j < t; j++ ) {
s += (char)(j+97);
bool v = recurse(j, t, K-cost[i][j], s);
if ( v ) return true;
s -= (char)(j+97);
}
return false;
}
In your implementation, you would probably need another vector of vectors of pairs to explore all your candidates. Also another vector for updating the current cost of each candidate as it builds up. Following this approach, things start to get a bit messy (IMO).
A more clean and understandable option (IMO again) could be to approach the problem with recursivity:
#include <iostream>
#include <vector>
#define K 3
using namespace std;
string exploreCandidate(int currentCost, string currentString, vector<vector<int>> &v)
{
if (currentCost == K)
return currentString;
int size = v.size();
int lastChar = (int)currentString.back() - 97; // get ASCII code
for (size_t j = 0; j < size; j++)
{
int nextTotalCost = currentCost + v[lastChar][j];
if (nextTotalCost > K)
continue;
string nextString = currentString + (char)(97 + j); // get ASCII char
string exploredString = exploreCandidate(nextTotalCost, nextString, v);
if (exploredString != "00") // It is a valid path
return exploredString;
}
return "00";
}
int main()
{
vector<vector<int>> v = {{2, 1}, {3, 4}};
int size = v.size();
string initialString = "00"; // reserve first two positions
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
initialString[0] = (char)(97 + i);
initialString[1] = (char)(97 + j);
string exploredString = exploreCandidate(v[i][j], initialString, v);
if (exploredString != "00") { // It is a valid path
cout << exploredString << endl;
return 0;
}
}
}
}
Let us begin from the main function:
We define our matrix and iterate over it. For each position, we define the corresponding sequence. Notice that we can use indices to get the respective character of the English alphabet, knowing that in ASCII code a=97, b=98...
Having this initial sequence, we can explore candidates recursively, which lead us to the exploreCandidate recursive function.
First, we want to make sure that the current cost is not the value we are looking for. If it is, we leave immediately without even evaluating the following iterations for candidates. We want to do this because we are looking for the lexicographically smallest element, and we are not asked to provide information about all the candidates.
If the cost condition is not satisfied (cost < K), we need to continue exploring our candidate, but not for the whole matrix but only for the row corresponding to the last character. Then we can encounter two scenarios:
The cost condition is met (cost = K): if at some point of recursivity the cost is equal to our value K, then the string is a valid one, and since it will be the first one we encounter, we want to return it and finish the execution.
The cost is not valid (cost > K): If the current cost is greater than K, then we need to abort this branch and see if other branches are luckier. Returning a boolean would be nice, but since we want to output a string (or maybe not, depending on the statement), an option could be to return a string and use "00" as our "false" value, allowing us to know whether the cost condition has been met. Other options could be returning a boolean and using an output parameter (passed by reference) to contain the output string.
EDIT:
The provided code assumes positive non-zero costs. If some costs were to be zero you could encounter infinite recursivity, so you would need to add more constraints in your recursive function.

Please tell me the efficient algorithm of Range Mex Query

I have a question about this problem.
Question
You are given a sequence a[0], a 1],..., a[N-1], and set of range (l[i], r[i]) (0 <= i <= Q - 1).
Calculate mex(a[l[i]], a[l[i] + 1],..., a[r[i] - 1]) for all (l[i], r[i]).
The function mex is minimum excluded value.
Wikipedia Page of mex function
You can assume that N <= 100000, Q <= 100000, and a[i] <= 100000.
O(N * (r[i] - l[i]) log(r[i] - l[i]) ) algorithm is obvious, but it is not efficient.
My Current Approach
#include <bits/stdc++.h>
using namespace std;
int N, Q, a[100009], l, r;
int main() {
cin >> N >> Q;
for(int i = 0; i < N; i++) cin >> a[i];
for(int i = 0; i < Q; i++) {
cin >> l >> r;
set<int> s;
for(int j = l; j < r; j++) s.insert(a[i]);
int ret = 0;
while(s.count(ret)) ret++;
cout << ret << endl;
}
return 0;
}
Please tell me how to solve.
EDIT: O(N^2) is slow. Please tell me more fast algorithm.
Here's an O((Q + N) log N) solution:
Let's iterate over all positions in the array from left to right and store the last occurrences for each value in a segment tree (the segment tree should store the minimum in each node).
After adding the i-th number, we can answer all queries with the right border equal to i.
The answer is the smallest value x such that last[x] < l. We can find by going down the segment tree starting from the root (if the minimum in the left child is smaller than l, we go there. Otherwise, we go to the right child).
That's it.
Here is some pseudocode:
tree = new SegmentTree() // A minimum segment tree with -1 in each position
for i = 0 .. n - 1
tree.put(a[i], i)
for all queries with r = i
ans for this query = tree.findFirstSmaller(l)
The find smaller function goes like this:
int findFirstSmaller(node, value)
if node.isLeaf()
return node.position()
if node.leftChild.minimum < value
return findFirstSmaller(node.leftChild, value)
return findFirstSmaller(node.rightChild)
This solution is rather easy to code (all you need is a point update and the findFisrtSmaller function shown above and I'm sure that it's fast enough for the given constraints.
Let's process both our queries and our elements in a left-to-right manner, something like
for (int i = 0; i < N; ++i) {
// 1. Add a[i] to all internal data structures
// 2. Calculate answers for all queries q such that r[q] == i
}
Here we have O(N) iterations of this loop and we want to do both update of the data structure and query the answer for suffix of currently processed part in o(N) time.
Let's use the array contains[i][j] which has 1 if suffix starting at the position i contains number j and 0 otherwise. Consider also that we have calculated prefix sums for each contains[i] separately. In this case we could answer each particular suffix query in O(log N) time using binary search: we should just find the first zero in the corresponding contains[l[i]] array which is exactly the first position where the partial sum is equal to index, and not to index + 1. Unfortunately, such arrays would take O(N^2) space and need O(N^2) time for each update.
So, we have to optimize. Let's build a 2-dimensional range tree with "sum query" and "assignment" range operations. In such tree we can query sum on any sub-rectangle and assign the same value to all the elements of any sub-rectangle in O(log^2 N) time, which allows us to do the update in O(log^2 N) time and queries in O(log^3 N) time, giving the time complexity O(Nlog^2 N + Qlog^3 N). The space complexity O((N + Q)log^2 N) (and the same time for initialization of the arrays) is achieved using lazy initialization.
UP: Let's revise how the query works in range trees with "sum". For 1-dimensional tree (to not make this answer too long), it's something like this:
class Tree
{
int l, r; // begin and end of the interval represented by this vertex
int sum; // already calculated sum
int overriden; // value of override or special constant
Tree *left, *right; // pointers to children
}
// returns sum of the part of this subtree that lies between from and to
int Tree::get(int from, int to)
{
if (from > r || to < l) // no intersection
{
return 0;
}
if (l <= from && to <= r) // whole subtree lies within the interval
{
return sum;
}
if (overriden != NO_OVERRIDE) // should push override to children
{
left->overriden = right->overriden = overriden;
left->sum = right->sum = (r - l) / 2 * overriden;
overriden = NO_OVERRIDE;
}
return left->get(from, to) + right->get(from, to); // split to 2 queries
}
Given that in our particular case all queries to the tree are prefix sum queries, from is always equal to 0, so, one of the calls to children always return a trivial answer (0 or already computed sum). So, instead of doing O(log N) queries to the 2-dimensional tree in the binary search algorithm, we could implement an ad-hoc procedure for search, very similar to this get query. It should first get the value of the left child (which takes O(1) since it's already calculated), then check if the node we're looking for is to the left (this sum is less than number of leafs in the left subtree) and go to the left or to the right based on this information. This approach will further optimize the query to O(log^2 N) time (since it's one tree operation now), giving the resulting complexity of O((N + Q)log^2 N)) both time and space.
Not sure this solution is fast enough for both Q and N up to 10^5, but it may probably be further optimized.

Implementation of string pattern matching using Suffix Array and LCP(-LR)

During the last weeks I tried to figure out how to efficiently find a string pattern within another string.
I found out that for a long time, the most efficient way would have been using a suffix tree. However, since this data structure is very expensive in space, I studied the use of suffix arrays further (which use far less space). Different papers such as "Suffix Arrays: A new method for on-line string searches" (Manber & Myers, 1993) state, that searching for a substring can be realised in O(P+log(N)) (where P is the length of the pattern and N is length of the string) by using binary search and suffix arrays along with LCP arrays.
I especially studied the latter paper to understand the search algorithm. This answer did a great job in helping me understand the algorithm (and incidentally made it into the LCP Wikipedia Page).
But I am still looking for an way to implement this algorithm. Especially the construction of the mentioned LCP-LR arrays seems very complicated.
References:
Manber & Myers, 1993: Manber, Udi ; Myers, Gene, SIAM Journal on Computing, 1993, Vol.22(5), pp.935-948, http://epubs.siam.org/doi/pdf/10.1137/0222058
UPDATE 1
Just to emphasize on what I am interested in: I understood LCP arrays and I found ways to implement them. However, the "plain" LCP array would not be appropriate for efficient pattern matching (as described in the reference). Thus I am interested in implementing LCP-LR arrays which seems much more complicated than just implementing an LCP array
UPDATE 2
Added link to referenced paper
The termin that can help you: enchanced suffix array, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).
These can be some of the examples:
https://code.google.com/p/esaxx/ ESAXX
http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA
The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.
If you take a look at the referenced paper, it mentions an useful property (4.2). Since SO does not support math, there is no point to copy it here.
I've done quick implementation, it uses segment tree:
// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1
// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
//
// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]
// rangeI = LCP_LR[i]
// rangeILeft = LCP_LR[2 * i]
// rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
if(low == high)
{
LCP_LR[index] = LCP[low];
return;
}
int mid = (low + high) / 2;
buildLCP_LR(2*index, low, mid);
buildLCP_LR(2*index+1, mid + 1, high);
LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}
Here is a fairly simple implementation in C++, though the build() procedure builds the suffix array in O(N lg^2 N) time. The lcp_compute() procedure has linear complexity. I have used this code in many programming contests, and it has never let me down :)
#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
const int MAX = 200005;
char str[MAX];
int N, h, sa[MAX], pos[MAX], tmp[MAX], lcp[MAX];
bool compare(int i, int j) {
if(pos[i] != pos[j]) return pos[i] < pos[j]; // compare by the first h chars
i += h, j += h; // if prefvious comparing failed, use 2*h chars
return (i < N && j < N) ? pos[i] < pos[j] : i > j; // return results
}
void build() {
N = strlen(str);
for(int i=0; i<N; ++i) sa[i] = i, pos[i] = str[i]; // initialize variables
for(h=1;;h<<=1) {
sort(sa, sa+N, compare); // sort suffixes
for(int i=0; i<N-1; ++i) tmp[i+1] = tmp[i] + compare(sa[i], sa[i+1]); // bucket suffixes
for(int i=0; i<N; ++i) pos[sa[i]] = tmp[i]; // update pos (reverse mapping of suffix array)
if(tmp[N-1] == N-1) break; // check if done
}
}
void lcp_compute() {
for(int i=0, k=0; i<N; ++i)
if(pos[i] != N-1) {
for(int j=sa[pos[i]+1]; str[i+k] == str[j+k];) k++;
lcp[pos[i]] = k;
if(k) k--;
}
}
int main() {
scanf("%s", str);
build();
for(int i=0; i<N; ++i) printf("%d\n", sa[i]);
return 0;
}
Note: If you want the complexity of the build() procedure to become O(N lg N), you can replace the STL sort with radix sort, but this is going to complicate the code.
Edit: Sorry, I misunderstood your question. Although i haven't implemented string matching with suffix array, I think I can describe you a simple non-standard, but fairly efficient algorithm for string matching. You are given two strings, the text, and the pattern. Given these string you create a new one, lets call it concat, which is the concatenation of the two given strings (first the text, then the pattern). You run the suffix array construction algorithm on concat, and you build the normal lcp array. Then, you search for a suffix of length pattern.size() in the suffix array you just built. Lets call its position in the suffix array pos. You then need two pointers lo and hi. At start lo = hi = pos. You decrease lo while lcp(lo, pos) = pattern.size() and you increase hi while lcp(hi, pos) = pattern.size(). Then you search for a suffix of length at least 2*pattern.size() in the range [lo, hi]. If you find it, you found a match. Otherwise, no match exists.
Edit[2]: I will be back with an implementation as soon as I have one...
Edit[3]:
Here it is:
// It works assuming you have builded the concatenated string and
// computed the suffix and the lcp arrays
// text.length() ---> tlen
// pattern.length() ---> plen
// concatenated string: str
bool match(int tlen, int plen) {
int total = tlen + plen;
int pos = -1;
for(int i=0; i<total; ++i)
if(total-sa[i] == plen)
{ pos = i; break; }
if(pos == -1) return false;
int lo, hi;
lo = hi = pos;
while(lo-1 >= 0 && lcp[lo-1] >= plen) lo--;
while(hi+1 < N && lcp[hi] >= plen) hi++;
for(int i=lo; i<=hi; ++i)
if(total-sa[i] >= 2*plen)
return true;
return false;
}
Here is a nice post including some code to help you better understand LCP array and comparison implementation.
I understand your desire is the code, rather than implementing your own.
Although written in Java this is an implementation of Suffix Array with LCP by Sedgewick and Wayne from their Algorithms booksite. It should save you some time and should not be tremendously hard to port to C/C++.
LCP array construction in pseudo for those who might want more information about the algorithm.
I think #Erti-Chris Eelmaa 's algorithm is wrong.
L ... 'M ... M ... M' ... R
|-----|-----|
Left sub range and right sub range should all contains M. Therefore we cannot do normal segment tree partition for LCP-LR array.
Code should look like
def lcp_from_i_j(i, j): # means [i, j] not [i, j)
if (j-i<1) return lcp_2_elem(i, j)
return lcp_merge(lcp_from_i_j(i, (i+j)/2), lcp_from_i_j((i+j)/2, j)
The left and the right sub ranges overlap. The segment tree supports range-min query. However, range min between [a,b] is not equal to lcp between [a,b]. LCP array is continuous, simple range-min would not work!

n-th or Arbitrary Combination of a Large Set

Say I have a set of numbers from [0, ....., 499]. Combinations are currently being generated sequentially using the C++ std::next_permutation. For reference, the size of each tuple I am pulling out is 3, so I am returning sequential results such as [0,1,2], [0,1,3], [0,1,4], ... [497,498,499].
Now, I want to parallelize the code that this is sitting in, so a sequential generation of these combinations will no longer work. Are there any existing algorithms for computing the ith combination of 3 from 500 numbers?
I want to make sure that each thread, regardless of the iterations of the loop it gets, can compute a standalone combination based on the i it is iterating with. So if I want the combination for i=38 in thread 1, I can compute [1,2,5] while simultaneously computing i=0 in thread 2 as [0,1,2].
EDIT Below statement is irrelevant, I mixed myself up
I've looked at algorithms that utilize factorials to narrow down each individual element from left to right, but I can't use these as 500! sure won't fit into memory. Any suggestions?
Here is my shot:
int k = 527; //The kth combination is calculated
int N=500; //Number of Elements you have
int a=0,b=1,c=2; //a,b,c are the numbers you get out
while(k >= (N-a-1)*(N-a-2)/2){
k -= (N-a-1)*(N-a-2)/2;
a++;
}
b= a+1;
while(k >= N-1-b){
k -= N-1-b;
b++;
}
c = b+1+k;
cout << "["<<a<<","<<b<<","<<c<<"]"<<endl; //The result
Got this thinking about how many combinations there are until the next number is increased. However it only works for three elements. I can't guarantee that it is correct. Would be cool if you compare it to your results and give some feedback.
If you are looking for a way to obtain the lexicographic index or rank of a unique combination instead of a permutation, then your problem falls under the binomial coefficient. The binomial coefficient handles problems of choosing unique combinations in groups of K with a total of N items.
I have written a class in C# to handle common functions for working with the binomial coefficient. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the K-indexes to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the set.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it is also faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
The following tested code will iterate through each unique combinations:
public void Test10Choose5()
{
String S;
int Loop;
int N = 500; // Total number of elements in the set.
int K = 3; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
You should be able to port this class over fairly easily to C++. You probably will not have to port over the generic part of the class to accomplish your goals. Your test case of 500 choose 3 yields 20,708,500 unique combinations, which will fit in a 4 byte int. If 500 choose 3 is simply an example case and you need to choose combinations greater than 3, then you will have to use longs or perhaps fixed point int.
You can describe a particular selection of 3 out of 500 objects as a triple (i, j, k), where i is a number from 0 to 499 (the index of the first number), j ranges from 0 to 498 (the index of the second, skipping over whichever number was first), and k ranges from 0 to 497 (index of the last, skipping both previously-selected numbers). Given that, it's actually pretty easy to enumerate all the possible selections: starting with (0,0,0), increment k until it gets to its maximum value, then increment j and reset k to 0 and so on, until j gets to its maximum value, and so on, until j gets to its own maximum value; then increment i and reset both j and k and continue.
If this description sounds familiar, it's because it's exactly the same way that incrementing a base-10 number works, except that the base is much funkier, and in fact the base varies from digit to digit. You can use this insight to implement a very compact version of the idea: for any integer n from 0 to 500*499*498, you can get:
struct {
int i, j, k;
} triple;
triple AsTriple(int n) {
triple result;
result.k = n % 498;
n = n / 498;
result.j = n % 499;
n = n / 499;
result.i = n % 500; // unnecessary, any legal n will already be between 0 and 499
return result;
}
void PrintSelections(triple t) {
int i, j, k;
i = t.i;
j = t.j + (i <= j ? 1 : 0);
k = t.k + (i <= k ? 1 : 0) + (j <= k ? 1 : 0);
std::cout << "[" << i << "," << j << "," << k << "]" << std::endl;
}
void PrintRange(int start, int end) {
for (int i = start; i < end; ++i) {
PrintSelections(AsTriple(i));
}
}
Now to shard, you can just take the numbers from 0 to 500*499*498, divide them into subranges in any way you'd like, and have each shard compute the permutation for each value in its subrange.
This trick is very handy for any problem in which you need to enumerate subsets.

How to find if 3 numbers in a set of size N exactly sum up to M

I want to know how I can implement a better solution than O(N^3). Its similar to the knapsack and subset problems. In my question N<=8000, so i started computing sums of pairs of numbers and stored them in an array. Then I would binary search in the sorted set for each (M-sum[i]) value but the problem arises how will I keep track of the indices which summed up to sum[i]. I know I could declare extra space but my Sums array already has a size of 64 million, and hence I couldn't complete my O(N^2) solution. Please advice if I can do some optimization or if I need some totally different technique.
You could benefit from some generic tricks to improve the performance of your algorithm.
1) Don't store what you use only once
It is a common error to store more than you really need. Whenever your memory requirement seem to blow up the first question to ask yourself is Do I really need to store that stuff ? Here it turns out that you do not (as Steve explained in comments), compute the sum of two numbers (in a triangular fashion to avoid repeating yourself) and then check for the presence of the third one.
We drop the O(N**2) memory complexity! Now expected memory is O(N).
2) Know your data structures, and in particular: the hash table
Perfect hash tables are rarely (if ever) implemented, but it is (in theory) possible to craft hash tables with O(1) insertion, check and deletion characteristics, and in practice you do approach those complexities (tough it generally comes at the cost of a high constant factor that will make you prefer so-called suboptimal approaches).
Therefore, unless you need ordering (for some reason), membership is better tested through a hash table in general.
We drop the 'log N' term in the speed complexity.
With those two recommendations you easily get what you were asking for:
Build a simple hash table: the number is the key, the index the satellite data associated
Iterate in triangle fashion over your data set: for i in [0..N-1]; for j in [i+1..N-1]
At each iteration, check if K = M - set[i] - set[j] is in the hash table, if it is, extract k = table[K] and if k != i and k != j store the triple (i,j,k) in your result.
If a single result is sufficient, you can stop iterating as soon as you get the first result, otherwise you just store all the triples.
There is a simple O(n^2) solution to this that uses only O(1)* memory if you only want to find the 3 numbers (O(n) memory if you want the indices of the numbers and the set is not already sorted).
First, sort the set.
Then for each element in the set, see if there are two (other) numbers that sum to it. This is a common interview question and can be done in O(n) on a sorted set.
The idea is that you start a pointer at the beginning and one at the end, if your current sum is not the target, if it is greater than the target, decrement the end pointer, else increment the start pointer.
So for each of the n numbers we do an O(n) search and we get an O(n^2) algorithm.
*Note that this requires a sort that uses O(1) memory. Hell, since the sort need only be O(n^2) you could use bubble sort. Heapsort is O(n log n) and uses O(1) memory.
Create a "bitset" of all the numbers which makes it constant time to check if a number is there. That is a start.
The solution will then be at most O(N^2) to make all combinations of 2 numbers.
The only tricky bit here is when the solution contains a repeat, but it doesn't really matter, you can discard repeats unless it is the same number 3 times because you will hit the "repeat" case when you pair up the 2 identical numbers and see if the unique one is present.
The 3 times one is simply a matter of checking if M is divisible by 3 and whether M/3 appears 3 times as you create the bitset.
This solution does require creating extra storage, up to MAX/8 where MAX is the highest number in your set. You could use a hash table though if this number exceeds a certain point: still O(1) lookup.
This appears to work for me...
#include <iostream>
#include <set>
#include <algorithm>
using namespace std;
int main(void)
{
set<long long> keys;
// By default this set is sorted
set<short> N;
N.insert(4);
N.insert(8);
N.insert(19);
N.insert(5);
N.insert(12);
N.insert(35);
N.insert(6);
N.insert(1);
typedef set<short>::iterator iterator;
const short M = 18;
for(iterator i(N.begin()); i != N.end() && *i < M; ++i)
{
short d1 = M - *i; // subtract the value at this location
// if there is more to "consume"
if (d1 > 0)
{
// ignore below i as we will have already scanned it...
for(iterator j(i); j != N.end() && *j < M; ++j)
{
short d2 = d1 - *j; // again "consume" as much as we can
// now the remainder must eixst in our set N
if (N.find(d2) != N.end())
{
// means that the three numbers we've found, *i (from first loop), *j (from second loop) and d2 exist in our set of N
// now to generate the unique combination, we need to generate some form of key for our keys set
// here we take advantage of the fact that all the numbers fit into a short, we can construct such a key with a long long (8 bytes)
// the 8 byte key is made up of 2 bytes for i, 2 bytes for j and 2 bytes for d2
// and is formed in sorted order
long long key = *i; // first index is easy
// second index slightly trickier, if it's less than j, then this short must be "after" i
if (*i < *j)
key = (key << 16) | *j;
else
key |= (static_cast<int>(*j) << 16); // else it's before i
// now the key is either: i | j, or j | i (where i & j are two bytes each, and the key is currently 4 bytes)
// third index is a bugger, we have to scan the key in two byte chunks to insert our third short
if ((key & 0xFFFF) < d2)
key = (key << 16) | d2; // simple, it's the largest of the three
else if (((key >> 16) & 0xFFFF) < d2)
key = (((key << 16) | (key & 0xFFFF)) & 0xFFFF0000FFFFLL) | (d2 << 16); // its less than j but greater i
else
key |= (static_cast<long long>(d2) << 32); // it's less than i
// Now if this unique key already exists in the hash, this won't insert an entry for it
keys.insert(key);
}
// else don't care...
}
}
}
// tells us how many unique combinations there are
cout << "size: " << keys.size() << endl;
// prints out the 6 bytes for representing the three numbers
for(set<long long>::iterator it (keys.begin()), end(keys.end()); it != end; ++it)
cout << hex << *it << endl;
return 0;
}
Okay, here is attempt two: this generates the output:
start: 19
size: 4
10005000c
400060008
500050008
600060006
As you can see from there, the first "key" is the three shorts (in hex), 0x0001, 0x0005, 0x000C (which is 1, 5, 12 = 18), etc.
Okay, cleaned up the code some more, realised that the reverse iteration is pointless..
My Big O notation is not the best (never studied computer science), however I think the above is something like, O(N) for outer and O(NlogN) for inner, reason for log N is that std::set::find() is logarithmic - however if you replace this with a hashed set, the inner loop could be as good as O(N) - please someone correct me if this is crap...
I combined the suggestions by #Matthieu M. and #Chris Hopman, and (after much trial and error) I came up with this algorithm that should be O(n log n + log (n-k)! + k) in time and O(log(n-k)) in space (the stack). That should be O(n log n) overall. It's in Python, but it doesn't use any Python-specific features.
import bisect
def binsearch(r, q, i, j): # O(log (j-i))
return bisect.bisect_left(q, r, i, j)
def binfind(q, m, i, j):
while i + 1 < j:
r = m - (q[i] + q[j])
if r < q[i]:
j -= 1
elif r > q[j]:
i += 1
else:
k = binsearch(r, q, i + 1, j - 1) # O(log (j-i))
if not (i < k < j):
return None
elif q[k] == r:
return (i, k, j)
else:
return (
binfind(q, m, i + 1, j)
or
binfind(q, m, i, j - 1)
)
def find_sumof3(q, m):
return binfind(sorted(q), m, 0, len(q) - 1)
Not trying to boast about my programming skills or add redundant stuff here.
Just wanted to provide beginners with an implementation in C++.
Implementation based on the pseudocode provided by Charles Ma at Given an array of numbers, find out if 3 of them add up to 0.
I hope the comments help.
#include <iostream>
using namespace std;
void merge(int originalArray[], int low, int high, int sizeOfOriginalArray){
// Step 4: Merge sorted halves into an auxiliary array
int aux[sizeOfOriginalArray];
int auxArrayIndex, left, right, mid;
auxArrayIndex = low;
mid = (low + high)/2;
right = mid + 1;
left = low;
// choose the smaller of the two values "pointed to" by left, right
// copy that value into auxArray[auxArrayIndex]
// increment either left or right as appropriate
// increment auxArrayIndex
while ((left <= mid) && (right <= high)) {
if (originalArray[left] <= originalArray[right]) {
aux[auxArrayIndex] = originalArray[left];
left++;
auxArrayIndex++;
}else{
aux[auxArrayIndex] = originalArray[right];
right++;
auxArrayIndex++;
}
}
// here when one of the two sorted halves has "run out" of values, but
// there are still some in the other half; copy all the remaining values
// to auxArray
// Note: only 1 of the next 2 loops will actually execute
while (left <= mid) {
aux[auxArrayIndex] = originalArray[left];
left++;
auxArrayIndex++;
}
while (right <= high) {
aux[auxArrayIndex] = originalArray[right];
right++;
auxArrayIndex++;
}
// all values are in auxArray; copy them back into originalArray
int index = low;
while (index <= high) {
originalArray[index] = aux[index];
index++;
}
}
void mergeSortArray(int originalArray[], int low, int high){
int sizeOfOriginalArray = high + 1;
// base case
if (low >= high) {
return;
}
// Step 1: Find the middle of the array (conceptually, divide it in half)
int mid = (low + high)/2;
// Steps 2 and 3: Recursively sort the 2 halves of origianlArray and then merge those
mergeSortArray(originalArray, low, mid);
mergeSortArray(originalArray, mid + 1, high);
merge(originalArray, low, high, sizeOfOriginalArray);
}
//O(n^2) solution without hash tables
//Basically using a sorted array, for each number in an array, you use two pointers, one starting from the number and one starting from the end of the array, check if the sum of the three elements pointed to by the pointers (and the current number) is >, < or == to the targetSum, and advance the pointers accordingly or return true if the targetSum is found.
bool is3SumPossible(int originalArray[], int targetSum, int sizeOfOriginalArray){
int high = sizeOfOriginalArray - 1;
mergeSortArray(originalArray, 0, high);
int temp;
for (int k = 0; k < sizeOfOriginalArray; k++) {
for (int i = k, j = sizeOfOriginalArray-1; i <= j; ) {
temp = originalArray[k] + originalArray[i] + originalArray[j];
if (temp == targetSum) {
return true;
}else if (temp < targetSum){
i++;
}else if (temp > targetSum){
j--;
}
}
}
return false;
}
int main()
{
int arr[] = {2, -5, 10, 9, 8, 7, 3};
int size = sizeof(arr)/sizeof(int);
int targetSum = 5;
//3Sum possible?
bool ans = is3SumPossible(arr, targetSum, size); //size of the array passed as a function parameter because the array itself is passed as a pointer. Hence, it is cummbersome to calculate the size of the array inside is3SumPossible()
if (ans) {
cout<<"Possible";
}else{
cout<<"Not possible";
}
return 0;
}