Given an array of N numbers,find the number of sequences of all lengths having the range of R?

Given an array of N numbers,find the number of sequences of all lengths having the range of R? - c++

This is a follow up question to Given a sequence of N numbers ,extract number of sequences of length K having range less than R?
I basically need a vector v as an answer of size N such that V[i] denotes number of sequences of length i which have range <=R.

Traditionally, in recursive solutions, you would compute the solution for K = 0, K = 1, and then find some kind of recurrence relation between subsequent elements to avoid recomputing the solution from scratch each time.
However here I believe that maybe attacking the problem from the other side would be interesting, because of the property of the spread:
Given a sequence of spread R (or less), any subsequence has a spread inferior to R as well
Therefore, I would first establish a list of the longest subsequences of spread R beginning at each index. Let's call this list M, and have M[i] = j where j is the higher index in S (the original sequence) for which S[j] - S[i] <= R. This is going to be O(N).
Now, for any i, the number of sequences of length K starting at i is either 0 or 1, and this depends whether K is greater than M[i] - i or not. A simple linear pass over M (from 0 to N-K) gives us the answer. This is once again O(N).
So, if we call V the resulting vector, with V[k] denoting the number of subsequences of length K in S with spread inferior to R, then we can do it in a single iteration over M:
for i in [0, len(M)]:
for k in [0, M[i] - i]:
++V[k]
The algorithm is simple, however the number of updates can be rather daunting. In the worst case, supposing than M[i] - i equals N - i, it is O(N*N) complexity. You would need a better data structure (probably an adaptation of a Fenwick Tree) to use this algorithm an lower the cost of computing those numbers.

If you are looking for contiguous sequences, try doing it recursively : The K-length subsequences set having a range inferior than R are included in the (K-1)-length subsequences set.
At K=0, you have N solutions.
Each time you increase K, you append (resp. prepend) the next (resp.previous) element, check if it the range is inferior to R, and either store it in a set (look for duplicates !) or discard it depending on the result.
If think the complexity of this algorithm is O(n*n) in the worst-case scenario, though it may be better on average.

I think Matthieu has the right answer when looking for all sequences with spread R.
As you are only looking for sequences of length K, you can do a little better.
Instead of looking at the maximum sequence starting at i, just look at the sequence of length K starting at i, and see if it has range R or not. Do this for every i, and you have all sequences of length K with spread R.
You don't need to go through the whole list, as the latest start point for a sequence of length K is n-K+1. So complexity is something like (n-K+1)*K = n*K - K*K + K. For K=1 this is n,
and for K=n it is n. For K=n/2 it is n*n/2 - n*n/4 + n/2 = n*n/2 + n/2, which I think is the maximum. So while this is still O(n*n), for most values of K you get a little better.

Start with a simpler problem: count the maximal length of sequences, starting at each index and having the range, equal to R.
To do this, let first pointer point to the first element of the array. Increase second pointer (also starting from the first element of the array) while sequence between pointers has the range, less or equal to R. Push every array element, passed by second pointer, to min-max-queue, made of a pair of mix-max-stacks, described in this answer. When difference between max and min values, reported by min-max-queue exceeds R, stop increasing second pointer, increment V[ptr2-ptr1], increment first pointer (removing element, pointed by it, from min-max-queue), and continue increasing second pointer (keeping range under control).
When second pointer leaves bounds of the array, increment V[N-ptr1] for all remaining ptr1 (corresponding ranges may be less or equal to R). To add all other ranges, that are less than R, compute cumulative sum of array V[], starting from its end.
Both time and space complexities are O(N).
Pseudo-code:
p1 = p2 = 0;
do {
do {
min_max_queue.push(a[p2]);
++p2;
} while (p2 < N && min_max_queue.range() <= R);
if (p2 < N) {
++v[p2 - p1 - 1];
min_max_queue.pop();
++p1;
}
} while (p2 < N);
for (i = 1; i <= N-p1; ++i) {
++v[i];
}
sum = 0;
for (j = N; j > 0; --j) {
value = v[j];
v[j] += sum;
sum += value;
}

Related

Proving that a two-pointer approach works (pair sum)

I was trying to solve the pair sum problem, i.e., given a sorted array, we need to if there exist two indices i and j such that i!=j and a[i]+a[j] == k for some k.
One of the approaches to do the same problem is running two nested for loops, resulting in a complexity of O(n*n).
Another way to solve it is using a two-pointer technique. I wasn't able to solve the problem using the two-pointer method and therefore looked it up, but I couldn't understand why it works. How do I prove that it works?
#define lli long long
//n is size of array
bool f(lli sum) {
int l = 0, r = n - 1;
while ( l < r ) {
if ( A[l] + A[r] == sum ) return 1;
else if ( A[l] + A[r] > sum ) r--;
else l++;
}
return 0;
}

Well, think of it this way:
You have a sorted array (you didn't mention that the array is sorted, but for this problem, that is generally the case):
{ -1,4,8,12 }
The algorithm starts by choosing the first element in the array and the last element, adding them together and comparing them to the sum you are after.
If our initial sum matches the sum we are looking for, great!! If not, well, we need to continue looking at possible sums either greater than or less than the sum we started with. By starting with the smallest and the largest value in the array for our initial sum, we can eliminate one of those elements as being part of a possible solution.
Let's say we are looking for the sum 3. We see that 3 < 11. Since our big number (12) is paired with the smallest possible number (-1), the fact that our sum is too large means that 12 cannot be part of any possible solution, since any other sum using 12 would have to be larger than 11 (12 + 4 > 12 - 1, 12 + 8 > 12 - 1).
So we know we cannot possibly make a sum of 3 using 12 + one other number in the array; they would all be too big. So we can eliminate 12 from our search by moving down to the next largest number, 8. We do the same thing here. We see 8 + -1 is still too big, so we move down to the next number, 4, and voila! We find a match.
The same logic applies if the sum we get is too small. We can eliminate our small number, because any sum we can get using our current smallest number has to be less than or equal to the sum we get when it is paired with our current largest number.
We keep doing this until we find a match, or until the indices cross each other, since, after they cross, we are simply adding up pairs of numbers we have already checked (i.e. 4 + 8 = 8 + 4).
This may not be a mathematical proof, but hopefully it illustrates how the algorithm works.

Stephen Docy made a great job tracing the program's execution and explaining the rationale behind its decisions. Maybe making the answer closer to a mathematical proof of the algorithm's correctness could make it easier to generalize to problems like the one mentioned by zzzzzzz in the comments.
We are given a sorted array A of length n and an integer sum. We need to find if there are any two indices i and j such that i != j and A[i] + A[j] == sum.
The solutions (i, j) and (j, i) are equivalent, so we can assume that i < j without loss of generality. In the program, the current guess at i is called l and the current guess at j is called r.
We iteratively slice the array till we find a slice that has the two summands that sum to sum at its boundary, or we find there is no such slice. The slice starts at index l and ends at index r and I will write it as (l, r).
Initially, the slice is the whole array. In each iteration, the length of the slice is decreased by 1: either the left boundary index l increases or the right boundary index r decreases. When the slice length decreases to 1 (l == r), there are no pairs of different indexes inside the slice, so false is returned. This means that the algorithm halts for any input. The O(n) complexity is also immediately clear. The correctness remains to be proven.
We can assume there is a solution; if there is none, the analysis in the above paragraph applies and the branch returning true can never be executed.
The loop has an invariant (statement that holds true regardless of how many iterations have been done yet): When a solution exists, it is either (l, r) itself or its sub-slice. Mathematically, such an invariant is a lemma -- something that is not very useful by itself but makes a stepping stone in the overall proof. We get the overall correctness by initially making (l, r) the whole array and observing that as each iteration makes the slice shorter, the invariant ensures that we will eventually find the solution. Now, we just need to prove the invariant.
We will prove the invariant by induction. The induction base is trivial -- the initial slice (l, r) either is the solution, or contains it as a sub-slice. The hard part is the induction step, i.e. proving that when (l, r) contains the solution, either it is the solution itself or the slice for the next iteration contains the solution as a sub-slice.
When A[l] + A[r] == sum, (l, r) is the solution itself; the first condition in the loop is triggered, true is returned, and everyone is happy.
When A[l] + A[r] > sum, the slice for the next iteration is (l, r - 1), which still contains the solution. Let's prove that by contradiction, assuming (l, r - 1) does not contain the solution. How could that happen, when (l, r) contained the solution (by induction hypothesis)? The only way would be that the solution (i, j) has j == r (r is the only index we removed from the slice). Because by definition A[i] + A[j] == sum, we get A[i] + A[r] == sum < A[l] + A[r] in this branch. When we subtract A[r] from both sides of the inequality, we get A[i] < A[l]. But A[l] is the smallest value in the (l, r) slice (the array is sorted), so this is a contradiction.
When A[l] + A[r] < sum, the slice for the next iteration is (l + 1, r). The argument is symmetric to the previous case.
∎
The algorithm may be easily rewritten as recursive, which simplifies the analysis at the expense of actual performance. This is the functional programming approach.
#define lli long long
//n is size of array
bool f(lli sum) {
return g(sum, 0, n - 1);
}
bool g(lli sum, int l, int r) {
if ( l >= r ) return 0;
else if ( A[l] + A[r] == sum ) return 1;
else if ( A[l] + A[r] > sum ) return g(sum, l, r - 1);
else return g(sum, l + 1, r);
}
The f function still contains the initialization, but it calls the new g function, which implements the original loop. Instead of keeping the state in local variables, it uses its parameters. Each call of the g function corresponds to a single iteration of the original loop.
The g function is a solution to a more general problem than the original one: Given a sorted array A, are there any two indices i and j such that i != j and A[i] + A[j] == sum and both i and j are between l and r (inclusive)?
This makes reading the analysis even simpler. The loop invariant is actually the proof of correctness of g and the structure of g guides the proof.

[Competitive Programming]:How do I optimise this brute force method? [duplicate]

If n numbers are given, how would I find the total number of possible triangles? Is there any method that does this in less than O(n^3) time?
I am considering a+b>c, b+c>a and a+c>b conditions for being a triangle.

Assume there is no equal numbers in given n and it's allowed to use one number more than once. For example, we given a numbers {1,2,3}, so we can create 7 triangles:
1 1 1
1 2 2
1 3 3
2 2 2
2 2 3
2 3 3
3 3 3
If any of those assumptions isn't true, it's easy to modify algorithm.
Here I present algorithm which takes O(n^2) time in worst case:
Sort numbers (ascending order).
We will take triples ai <= aj <= ak, such that i <= j <= k.
For each i, j you need to find largest k that satisfy ak <= ai + aj. Then all triples (ai,aj,al) j <= l <= k is triangle (because ak >= aj >= ai we can only violate ak < a i+ aj).
Consider two pairs (i, j1) and (i, j2) j1 <= j2. It's easy to see that k2 (found on step 2 for (i, j2)) >= k1 (found one step 2 for (i, j1)). It means that if you iterate for j, and you only need to check numbers starting from previous k. So it gives you O(n) time complexity for each particular i, which implies O(n^2) for whole algorithm.
C++ source code:
int Solve(int* a, int n)
{
int answer = 0;
std::sort(a, a + n);
for (int i = 0; i < n; ++i)
{
int k = i;
for (int j = i; j < n; ++j)
{
while (n > k && a[i] + a[j] > a[k])
++k;
answer += k - j;
}
}
return answer;
}
Update for downvoters:
This definitely is O(n^2)! Please read carefully "An Introduction of Algorithms" by Thomas H. Cormen chapter about Amortized Analysis (17.2 in second edition).
Finding complexity by counting nested loops is completely wrong sometimes.
Here I try to explain it as simple as I could. Let's fix i variable. Then for that i we must iterate j from i to n (it means O(n) operation) and internal while loop iterate k from i to n (it also means O(n) operation). Note: I don't start while loop from the beginning for each j. We also need to do it for each i from 0 to n. So it gives us n * (O(n) + O(n)) = O(n^2).

There is a simple algorithm in O(n^2*logn).
Assume you want all triangles as triples (a, b, c) where a <= b <= c.
There are 3 triangle inequalities but only a + b > c suffices (others then hold trivially).
And now:
Sort the sequence in O(n * logn), e.g. by merge-sort.
For each pair (a, b), a <= b the remaining value c needs to be at least b and less than a + b.
So you need to count the number of items in the interval [b, a+b).
This can be simply done by binary-searching a+b (O(logn)) and counting the number of items in [b,a+b) for every possibility which is b-a.
All together O(n * logn + n^2 * logn) which is O(n^2 * logn). Hope this helps.

If you use a binary sort, that's O(n-log(n)), right? Keep your binary tree handy, and for each pair (a,b) where a b and c < (a+b).

Let a, b and c be three sides. The below condition must hold for a triangle (Sum of two sides is greater than the third side)
i) a + b > c
ii) b + c > a
iii) a + c > b
Following are steps to count triangle.
Sort the array in non-decreasing order.
Initialize two pointers ‘i’ and ‘j’ to first and second elements respectively, and initialize count of triangles as 0.
Fix ‘i’ and ‘j’ and find the rightmost index ‘k’ (or largest ‘arr[k]‘) such that ‘arr[i] + arr[j] > arr[k]‘. The number of triangles that can be formed with ‘arr[i]‘ and ‘arr[j]‘ as two sides is ‘k – j’. Add ‘k – j’ to count of triangles.
Let us consider ‘arr[i]‘ as ‘a’, ‘arr[j]‘ as b and all elements between ‘arr[j+1]‘ and ‘arr[k]‘ as ‘c’. The above mentioned conditions (ii) and (iii) are satisfied because ‘arr[i] < arr[j] < arr[k]'. And we check for condition (i) when we pick 'k'
4.Increment ‘j’ to fix the second element again.
Note that in step 3, we can use the previous value of ‘k’. The reason is simple, if we know that the value of ‘arr[i] + arr[j-1]‘ is greater than ‘arr[k]‘, then we can say ‘arr[i] + arr[j]‘ will also be greater than ‘arr[k]‘, because the array is sorted in increasing order.
5.If ‘j’ has reached end, then increment ‘i’. Initialize ‘j’ as ‘i + 1′, ‘k’ as ‘i+2′ and repeat the steps 3 and 4.
Time Complexity: O(n^2).
The time complexity looks more because of 3 nested loops. If we take a closer look at the algorithm, we observe that k is initialized only once in the outermost loop. The innermost loop executes at most O(n) time for every iteration of outer most loop, because k starts from i+2 and goes upto n for all values of j. Therefore, the time complexity is O(n^2).

I have worked out an algorithm that runs in O(n^2 lgn) time. I think its correct...
The code is wtitten in C++...
int Search_Closest(A,p,q,n) /*Returns the index of the element closest to n in array
A[p..q]*/
{
if(p<q)
{
int r = (p+q)/2;
if(n==A[r])
return r;
if(p==r)
return r;
if(n<A[r])
Search_Closest(A,p,r,n);
else
Search_Closest(A,r,q,n);
}
else
return p;
}
int no_of_triangles(A,p,q) /*Returns the no of triangles possible in A[p..q]*/
{
int sum = 0;
Quicksort(A,p,q); //Sorts the array A[p..q] in O(nlgn) expected case time
for(int i=p;i<=q;i++)
for(int j =i+1;j<=q;j++)
{
int c = A[i]+A[j];
int k = Search_Closest(A,j,q,c);
/* no of triangles formed with A[i] and A[j] as two sides is (k+1)-2 if A[k] is small or equal to c else its (k+1)-3. As index starts from zero we need to add 1 to the value*/
if(A[k]>c)
sum+=k-2;
else
sum+=k-1;
}
return sum;
}
Hope it helps........

possible answer
Although we can use binary search to find the value of 'k' hence improve time complexity!

N0,N1,N2,...Nn-1
sort
X0,X1,X2,...Xn-1 as X0>=X1>=X2>=...>=Xn-1
choice X0(to Xn-3) and choice form rest two item x1...
choice case of (X0,X1,X2)
check(X0<X1+X2)
OK is find and continue
NG is skip choice rest

It seems there is no algorithm better than O(n^3). In the worst case, the result set itself has O(n^3) elements.
For Example, if n equal numbers are given, the algorithm has to return n*(n-1)*(n-2) results.

Whats the efficient way to sum up the elements of an array in following way?

Suppose you are given an n sized array A and a integer k
Now you have to follow this function:
long long sum(int k)
{
long long sum=0;
for(int i=0;i<n;i++){
sum+=min(A[i],k);
}
return sum;
}
what is the most efficient way to find sum?
EDIT: if I am given m(<=100000) queries, and given a different k every time, it becomes very time consuming.

If set of queries changes with each k then you can't do better than in O(n). Your only options for optimizing is to use multiple threads (each thread sums some region of array) or at least ensure that your loop is properly vectorized by compiler (or write vectorized version manually using intrinsics).
But if set of queries is fixed and only k is changed, then you may do in O(log n) by using following optimization.
Preprocess array. This is done only once for all ks:
Sort elements
Make another array of the same length which contains partial sums
For example:
inputArray: 5 1 3 8 7
sortedArray: 1 3 5 7 8
partialSums: 1 4 9 16 24
Now, when new k is given, you need to perform following steps:
Make binary search for given k in sortedArray -- returns index of maximal element <= k
Result is partialSums[i] + (partialSums.length - i) * k

You can do way better than that if you can sort the array A[i] and have a secondary array prepared once.
The idea is:
Count how many items are less than k, and just compute the equivalent sum by the formula: count*k
Prepare an helper array which will give you the sum of the items superior to k directly
Preparation
Step 1: sort the array
std::sort(begin(A), end(A));
Step 2: prepare an helper array
std::vector<long long> p_sums(A.size());
std::partial_sum(rbegin(A), rend(A), begin(p_sums));
Query
long long query(int k) {
// first skip all items whose value is below k strictly
auto it = std::lower_bound(begin(A), end(A), k);
// compute the distance (number of items skipped)
auto index = std::distance(begin(A), it);
// do the sum
long long result = index*k + p_sums[index];
return result;
}
The complexity of the query is: O(log(N)) where N is the length of the array A.
The complexity of the preparation is: O(N*log(N)). We could go down to O(N) with a radix sort but I don't think it is useful in your case.
References
std::sort()
std::partial_sum()
std::lower_bound()

What you do seems absolutely fine. Unless this is really absolutely time critical (that is customers complain that your app is too slow and you measured it, and this function is the problem, in which case you can try some non-portable vector instructions, for example).
Often you can do things more efficiently by looking at them from a higher level. For example, if I write
for (n = 0; n < 1000000; ++n)
printf ("%lld\n", sum (100));
then this will take an awful long time (half a trillion additions) and can be done a lot quicker. Same if you change one element of the array A at a time and recalculate sum each time.

Suppose there are x elements of array A which are no larger than k and set B contains those elements which are larger than k and belongs to A.
Then the result of function sum(k) equals
k * x + sum_b
,where sum_b is the sum of elements belonging to B.
You can firstly sort the the array A, and calculate the array pre_A, where
pre_A[i] = pre_A[i - 1] + A[i] (i > 0),
or 0 (i = 0);
Then for each query k, use binary search on A to find the largest element u which is no larger than k. Assume the index of u is index_u, then sum(k) equals
k * index_u + pre_A[n] - pre_A[index_u]
. The time complex for each query is log(n).
In case array A may be dynamically changed, you can use BST to handle it.

Suffix Array Algorithm

After quite a bit of reading, I have figured out what a suffix array and LCP array represents.
Suffix array: Represents the _lexicographic rank of each suffix of an array.
LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works.
Here is the code , which is taken from Codeforces:
/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>
using namespace std;
#define REP(i, n) for (int i = 0; i < (int)(n); ++i)
namespace SuffixArray
{
const int MAXN = 1 << 21;
char * S;
int N, gap;
int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
void buildSA()
{
N = strlen(S);
REP(i, N) sa[i] = i, pos[i] = S[i];
for (gap = 1;; gap *= 2)
{
sort(sa, sa + N, sufCmp);
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
REP(i, N) pos[sa[i]] = tmp[i];
if (tmp[N - 1] == N - 1) break;
}
}
void buildLCP()
{
for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
{
for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
++k;
lcp[pos[i]] = k;
if (k)--k;
}
}
} // end namespace SuffixArray
I cannot, just cannot get through how this algorithm works. I tried working on an example using pencil and paper, and wrote through the steps involved, but lost link in between as its too complicated, for me at least.
Any help regarding explanation, using an example maybe, is highly appreciated.

Overview
This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort a 2-pass bucket sort had been used).
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2i-grams. There can obviously be no more than log2(n) such iterations, and the trick is that sorting the 2i-grams in the i-th step is facilitated by making sure that each comparison of two 2i-grams is done in O(1) time (rather than O(2i) time).
How does it do this? Well, in the first iteration it sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n) that stores, for each bigram, its rank in the bigram sorting.
Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'}. We then assign ranks (i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a new bigram changes. So the ranks we assign are as follows:
ab : 0
ab : 0 [no change to previous]
ca : 1 [increment because different from previous]
cd : 2 [increment because different from previous]
cd : 2 [no change to previous]
ea : 3 [increment because different from previous]
These ranks are known as lexicographic names.
Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking up the ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-grams.
After sorting, we create new lexicographic names again, this time for the 4-grams.
In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.
And so forth. Each iteration i has two steps:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
We repeat this until all 2i-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1, then each 2i-gram must have been given its own, distinct lexicographic name.
Implementation
Now let's look at the code to confirm all of this. The variables used are as follows: sa[] is the suffix array we are building. pos[] is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k] contains the lexicographic name of the k-th m-gram of the previous step. tmp[] is an auxiliary array used to help create pos[].
I'll give further explanations between the code lines:
void buildSA()
{
N = strlen(S);
/* This is a loop that initializes sa[] and pos[].
For sa[] we assume the order the suffixes have
in the given string. For pos[] we set the lexicographic
rank of each 1-gram using the characters themselves.
That makes sense, right? */
REP(i, N) sa[i] = i, pos[i] = S[i];
/* Gap is the length of the m-gram in each step, divided by 2.
We start with 2-grams, so gap is 1 initially. It then increases
to 2, 4, 8 and so on. */
for (gap = 1;; gap *= 2)
{
/* We sort by (gap*2)-grams: */
sort(sa, sa + N, sufCmp);
/* We compute the lexicographic rank of each m-gram
that we have sorted above. Notice how the rank is computed
by comparing each n-gram at position i with its
neighbor at i+1. If they are identical, the comparison
yields 0, so the rank does not increase. Otherwise the
comparison yields 1, so the rank increases by 1. */
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
/* tmp contains the rank by position. Now we map this
into pos, so that in the next step we can look it
up per m-gram, rather than by position. */
REP(i, N) pos[sa[i]] = tmp[i];
/* If the largest lexicographic name generated is
n-1, we are finished, because this means all
m-grams must have been different. */
if (tmp[N - 1] == N - 1) break;
}
}
About the comparison function
The function sufCmp is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap, which is a global variable.
A naive implementation of sufCmp would be this:
bool sufCmp(int i, int j)
{
int pos_i = sa[i];
int pos_j = sa[j];
int end_i = pos_i + 2*gap;
int end_j = pos_j + 2*gap;
if (end_i > N)
end_i = N;
if (end_j > N)
end_j = N;
while (i < end_i && j < end_j)
{
if (S[pos_i] != S[pos_j])
return S[pos_i] < S[pos_j];
pos_i += 1;
pos_j += 1;
}
return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}
This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i] with the one found at the beginning of the j-th suffix pos_j:=sa[j]. And it would compare them character by character, i.e. comparing S[pos_i] with S[pos_j], then S[pos_i+1] with S[pos_j+1] and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b in a function returning int means you return 1 if the condition is true, and 0 if it is false.)
The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i or pos_j will reach N before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorter one is lexicographically smaller. If pos_i has reached the end, the i-th suffix must be shorter than the j-th suffix.
Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
As you can see, instead of looking up individual characters S[i] and S[j], we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j], then the i-th suffix sa[i] must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]. In other words, simply by looking up pos[i] and pos[j] and comparing them, we have compared the first gap characters of the two suffixes.
If the ranks are identical, we continue by comparing pos[i+gap] with pos[j+gap]. This is the same as comparing the next gap characters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.
Example
The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.
The string we want to sort is abcxabcd. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S (the string), sa (the current state of the suffix array) and tmp and pos, which represent the lexicographic names.
First, we initialize:
S abcxabcd
sa 01234567
pos abcxabcd
Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.
First iteration:
Sorting sa, using bigrams as sorting criterion:
sa 04156273
The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.
Generating the lexicographic names:
tmp 00112345
As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).
Finally, we map this according to the positions in sa, to get pos:
sa 04156273
tmp 00112345
pos 01350124
(The way pos is generated is this: Go through sa from left to right, and use the entry to define the index in pos. Use the corresponding entry in tmp to define the value for that index. So pos[0]:=0, pos[4]:=0, pos[1]:=1, pos[5]:=1, pos[6]:=2, and so on. The index comes from sa, the value from tmp.)
Second iteration:
We sort sa again, and again we look at bigrams from pos (which each represents a sequence of two bigrams of the original string).
sa 04516273
Notice how the position of 1 5 have switched compared to the previous version of sa. It used to be 15, now it is 51. This is because the bigram at pos[1] and the bigram at pos[5] used to be identical (both bc) in during the previous iteration, but now the bigram at pos[5] is 12, while the bigram at pos[1] is 13. So position 5 comes before position 1. This is due to the fact that the lexicographic names now each represent bigrams of the original string: pos[5] represents bc and pos[6] represents 'cd'. So, together they represent bcd, while pos[1] represents bc and pos[2] represents cx, so together they represent bcx, which is indeed lexicographically greater than bcd.
Again, we generate lexicographic names by screening the current version of sa from left to right and comparing the corrsponding bigrams in pos:
tmp 00123456
The first two entries are still identical (both 0), because the corresponding bigrams in pos are both 01. The rest is an strictly increasing sequence of integers, because all other bigrams in pos are each unique.
We perform the mapping to the new pos as before (taking indices from sa and values from tmp):
sa 04516273
tmp 00123456
pos 02460135
Third iteration:
We sort sa again, taking bigrams of pos (as always), which now each represents a sequence of 4 bigrams of the orginal string.
sa 40516273
You'll notice that now the first two entries have switched positions: 04 has become 40. This is because the bigram at pos[0] is 02 while the one at pos[4] is 01, the latter obviously being lexicographically smaller. The deep reason is that these two represent abcx and abcd, respectively.
Generating lexicographic names yields:
tmp 01234567
They are all different, i.e. the highest one is 7, which is n-1. So, we are done, because are sorting is now based on m-grams that are all different. Even if we continued, the sorting order would not change.
Improvement suggestion
The algorithm used to sort the 2i-grams in each iteration appears to be the built-in sort (or std::sort). This means it's a comparison sort, which takes O(n log n) time in the worst case, in each iteration. Since there are log n iterations in the worst case, this makes it a O(n (log n)2)-time algorithm. However, the sorting could by performed using two passes of bucket sort, since the keys we use for the sort comparison (i.e. the lexicographic names of the previous step), form an increasing integer sequence. So this could be improved to an actual O(n log n)-time algorithm for suffix sorting.
Remark
I believe this is the original algorithm for suffix array construction that was suggested in the 1992-paper by Manber and Myers (link on Google Scholar; it should be the first hit, and it may have a link to a PDF there). This (at the same time, but independently of a paper by Gonnet and Baeza-Yates) was what introduced suffix arrays (also known as pat arrays at the time) as a data structure interesting for further study.
Modern algorithms for suffix array construction are O(n), so the above is no longer the best algorithm available (at least not in terms of theoretical, worst-case complexity).
Footnotes
(*) By 2-gram I mean a sequence of two consecutive characters of the original string. For example, when S=abcde is the string, then ab, bc, cd, de are the 2-grams of S. Similarly, abcd and bcde are the 4-grams. Generally, an m-gram (for a positive integer m) is a sequence of m consecutive characters. 1-grams are also called unigrams, 2-grams are called bigrams, 3-grams are called trigrams. Some people continue with tetragrams, pentagrams and so on.
Note that the suffix of S that starts and position i, is an (n-i)-gram of S. Also, every m-gram (for any m) is a prefix of one of the suffixes of S. Therefore, sorting m-grams (for an m as large as possible) can be the first step towards sorting suffixes.

Finding the maximum weight subsequence of an array of positive integers?

I'm tring to find the maximum weight subsequence of an array of positive integers - the catch is that no adjacent members are allowed in the final subsequence.
The exact same question was asked here, and a recursive solution was given by MarkusQ thus:
function Max_route(A)
if A's length = 1
A[0]
else
maximum of
A[0]+Max_route(A[2...])
Max_route[1...]
He provides an explanation, but can anyone help me understand how he has expanded the function? Specifically what does he mean by
f[] :- [],0
f [x] :- [x],x
f [a,b] :- if a > b then [a],a else [b],b
f [a,b,t] :-
ft = f t
fbt = f [b|t]
if a + ft.sum > fbt.sum
[a|ft.path],a+ft.sum
else
fbt
Why does he expand f[] to [],0? Also how does his solution take into consideration non-adjacent members?
I have some C++ code that is based on this algorithm, which I can post if anyone wants to see it, but I just can't for the life of me fathom why it works.
==========For anyone who's interested - the C++ code ==============
I should add, that the array of integers is to be treated as a circular list, so any sequence containing the first element cannot contain the last.
int memo[55][55];
int solve(int s, int e)
{
if( s>e ) return 0;
int &ret=memo[s][e];
if(ret!=-1)
{
return ret;
}
ret=max(solve(s+1,e), solve(s+2,e)+a[s]);
return ret;
}
class Sequence
{
public:
int maxSequence(vector <int> s)
{
memset(memo,-1);
int n = s.size();
for(int i=0; i<n; i++)
a[i]=s[i];
return max(solve(0,n-2),solve(1,n-1));
}
};

I don't really understand that pseudocode, so post the C++ code if this isn't helpful and I'll try to improve it.
I'm tring to find the maximum weight subsequence of an array of positive integers - the catch is that no adjacent members are allowed in the final subsequence.
Let a be your array of positive ints. Let f[i] = value of the maximum weight subsequence of the sequence a[0..i].
We have:
f[0] = a[0] because if there's only one element, we have to take it.
f[1] = max(a[0], a[1]) because you have the no adjacent elements restriction, so if you have two elements, you can only take one of them. It makes sense to take the largest one.
Now, generally you have:
f[i > 1] = max(
f[i - 2] + a[i] <= add a[i] to the largest subsequence of the sequence a[0..i - 2]. We cannot take a[0..i - 1] because otherwise we risk adding an adjacent element.
f[i - 1] <= don't add the current element to the maximum of a[0..i - 2], instead take the maximum of a[0..i - 1], to which we cannot add a[i].
)
I think this way is easier to understand than what you have there. The approaches are equivalent, I just find this clearer for this particular problem, since recursion makes things harder in this case and the pseudocode could be clearer either way.

But what do you NOT understand? It seems quite clear for me:
we will build the maximal subsequence for every prefix of our given sequence
to calculate the maximal subsequence for prefix of length i, we consider two possibilities: Either the last element is, or isn't in the maximal subsequence (clearly there are no other possibilities).
if it is there, we consider the value of the last element, plus the value of maximal subsequence of the prefix two elements shorter (because in this case, we know the last element cannot be present in the maximal subsequence because of the adjacent elements rule)
if it isn't we take the value of maximal sum of prefix one element shorter (if the last element of the prefix is not in the maximal subsequence, the maximal subsequence has to be equal for this and the previous prefix)
we compare and take the maximum of the two
Plus: you need to remember actual subsequences; you need to avoid superfluous function invocations, hence the memoization.
Why does he expand f[] to [],0?
Because the first from the pair in return value means current maximal subsequence, and the second is its value. Maximal subsequence of an empty sequence is empty and has value zero.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js