Finding the maximum weight subsequence of an array of positive integers? - c++

I'm tring to find the maximum weight subsequence of an array of positive integers - the catch is that no adjacent members are allowed in the final subsequence.
The exact same question was asked here, and a recursive solution was given by MarkusQ thus:
function Max_route(A)
if A's length = 1
A[0]
else
maximum of
A[0]+Max_route(A[2...])
Max_route[1...]
He provides an explanation, but can anyone help me understand how he has expanded the function? Specifically what does he mean by
f[] :- [],0
f [x] :- [x],x
f [a,b] :- if a > b then [a],a else [b],b
f [a,b,t] :-
ft = f t
fbt = f [b|t]
if a + ft.sum > fbt.sum
[a|ft.path],a+ft.sum
else
fbt
Why does he expand f[] to [],0? Also how does his solution take into consideration non-adjacent members?
I have some C++ code that is based on this algorithm, which I can post if anyone wants to see it, but I just can't for the life of me fathom why it works.
==========For anyone who's interested - the C++ code ==============
I should add, that the array of integers is to be treated as a circular list, so any sequence containing the first element cannot contain the last.
int memo[55][55];
int solve(int s, int e)
{
if( s>e ) return 0;
int &ret=memo[s][e];
if(ret!=-1)
{
return ret;
}
ret=max(solve(s+1,e), solve(s+2,e)+a[s]);
return ret;
}
class Sequence
{
public:
int maxSequence(vector <int> s)
{
memset(memo,-1);
int n = s.size();
for(int i=0; i<n; i++)
a[i]=s[i];
return max(solve(0,n-2),solve(1,n-1));
}
};

I don't really understand that pseudocode, so post the C++ code if this isn't helpful and I'll try to improve it.
I'm tring to find the maximum weight subsequence of an array of positive integers - the catch is that no adjacent members are allowed in the final subsequence.
Let a be your array of positive ints. Let f[i] = value of the maximum weight subsequence of the sequence a[0..i].
We have:
f[0] = a[0] because if there's only one element, we have to take it.
f[1] = max(a[0], a[1]) because you have the no adjacent elements restriction, so if you have two elements, you can only take one of them. It makes sense to take the largest one.
Now, generally you have:
f[i > 1] = max(
f[i - 2] + a[i] <= add a[i] to the largest subsequence of the sequence a[0..i - 2]. We cannot take a[0..i - 1] because otherwise we risk adding an adjacent element.
f[i - 1] <= don't add the current element to the maximum of a[0..i - 2], instead take the maximum of a[0..i - 1], to which we cannot add a[i].
)
I think this way is easier to understand than what you have there. The approaches are equivalent, I just find this clearer for this particular problem, since recursion makes things harder in this case and the pseudocode could be clearer either way.

But what do you NOT understand? It seems quite clear for me:
we will build the maximal subsequence for every prefix of our given sequence
to calculate the maximal subsequence for prefix of length i, we consider two possibilities: Either the last element is, or isn't in the maximal subsequence (clearly there are no other possibilities).
if it is there, we consider the value of the last element, plus the value of maximal subsequence of the prefix two elements shorter (because in this case, we know the last element cannot be present in the maximal subsequence because of the adjacent elements rule)
if it isn't we take the value of maximal sum of prefix one element shorter (if the last element of the prefix is not in the maximal subsequence, the maximal subsequence has to be equal for this and the previous prefix)
we compare and take the maximum of the two
Plus: you need to remember actual subsequences; you need to avoid superfluous function invocations, hence the memoization.
Why does he expand f[] to [],0?
Because the first from the pair in return value means current maximal subsequence, and the second is its value. Maximal subsequence of an empty sequence is empty and has value zero.

Related

Algorithm: Longest Approximate interval

I am trying to solve this question:
When Xellos was doing a practice course in university, he once had to
measure the intensity of an effect that slowly approached equilibrium.
A good way to determine the equilibrium intensity would be choosing a
sufficiently large number of consecutive data points that seems as
constant as possible and taking their average. Of course, with the
usual sizes of data, it's nothing challenging — but why not make a
similar programming contest problem while we're at it?
You're given a sequence of n data points a1, ..., an. There aren't any
big jumps between consecutive data points — for each 1 ≤ i < n, it's
guaranteed that |ai + 1 - ai| ≤ 1.
A range [l, r] of data points is said to be almost constant if the
difference between the largest and the smallest value in that range is
at most 1. Formally, let M be the maximum and m the minimum value of
ai for l ≤ i ≤ r; the range [l, r] is almost constant if M - m ≤ 1.
Find the length of the longest almost constant range.
Input The first line of the input contains a single integer n
(2 ≤ n ≤ 100 000) — the number of data points.
The second line contains n integers a1, a2, ..., an
(1 ≤ ai ≤ 100 000).
Output Print a single number — the maximum length of an almost
constant range of the given sequence.
https://codeforces.com/problemset/problem/602/B
I see a solution here but I don't understand the algorithm - specifically the body of the loop. I am familiar with C++ syntax and I understand what's happening. I just don't understand why the algorithm works.
#include<cstdio>
#include<algorithm>
using namespace std;
int a[1000005];
int main()
{
int n,ans = 2,x;
scanf("%d",&n);
for(int i = 1; i <= n; i++)
{
scanf("%d",&x);
a[x] = i;
if(a[x-1] > a[x+1])
ans = max(ans,i-max(a[x+1],a[x-2]));
else
ans = max(ans,i-max(a[x+2],a[x-1]));
}
printf("%d\n",ans);
return 0;
}
Could someone explain it to me?
Array a stands for the last position of value x.
Now let's calculate the left bound for each position(with value x), if a[x - 1] is more close to a[x] compare to a[x + 1], it means the position that will break the rule of almost constant is at a[x + 1](because there is a x - 1 in between) or a[x - 2](because there is a x - 1 in between).
Vice versa.

Nearest permutation to given array

Question
I have two arrays of integers A[] and B[]. Array B[] is fixed, I need to to find the permutation of A[] which is lexiographically smaller than B[] and the permutation is nearest to B[]. Here what I mean is:
for i in (0 <= i < n)
abs(B[i]-A[i]) is minimum and A[] should be smaller than B[] lexiographically.
For Example:
A[]={1,3,5,6,7}
B[]={7,3,2,4,6}
So,possible nearest permutation of A[] to B[] is
A[]={7,3,1,6,5}
My Approach
Try all permutation of A[] and then compare that with B[]. But the time complexity would be (n! * n)
So is there any way to optimize this?
EDIT
n can be as large as 10^5
For better understanding
First, build an ordered map of the counts of the distinct elements of A.
Then, iterate forward through array indices (0 to n−1), "withdrawing" elements from this map. At each point, there are three possibilities:
If i < n-1, and it's possible to choose A[i] == B[i], do so and continue iterating forward.
Otherwise, if it's possible to choose A[i] < B[i], choose the greatest possible value for A[i] < B[i]. Then proceed by choosing the largest available values for all subsequent array indices. (At this point you no longer need to worry about maintaining A[i] <= B[i], because we're already after an index where A[i] < B[i].) Return the result.
Otherwise, we need to backtrack to the last index where it was possible to choose A[i] < B[i], then use the approach in the previous bullet-point.
Note that, despite the need for backtracking, the very worst case here is three passes: one forward pass using the logic in the first bullet-point, one backward pass in backtracking to find the last index where A[i] < B[i] was possible, and then a final forward pass using the logic in the second bullet-point.
Because of the overhead of maintaining the ordered map, this requires O(n log m) time and O(m) extra space, where n is the total number of elements of A and m is the number of distinct elements. (Since m ≤ n, we can also express this as O(n log n) time and O(n) extra space.)
Note that if there's no solution, then the backtracking step will reach all the way to i == -1. You'll probably want to raise an exception if that happens.
Edited to add (2019-02-01):
In a now-deleted answer, גלעד ברקן summarizes the goal this way:
To be lexicographically smaller, the array must have an initial optional section from left to right where A[i] = B[i] that ends with an element A[j] < B[j]. To be closest to B, we want to maximise the length of that section, and then maximise the remaining part of the array.
So, with that summary in mind, another approach is to do two separate loops, where the first loop determines the length of the initial section, and the second loop actually populates A. This is equivalent to the above approach, but may make for cleaner code. So:
Build an ordered map of the counts of the distinct elements of A.
Initialize initial_section_length := -1.
Iterate through the array indices 0 to n−1, "withdrawing" elements from this map. For each index:
If it's possible to choose an as-yet-unused element of A that's less than the current element of B, set initial_section_length equal to the current array index. (Otherwise, don't.)
If it's not possible to choose an as-yet-unused element of A that's equal to the current element of B, break out of this loop. (Otherwise, continue looping.)
If initial_section_length == -1, then there's no solution; raise an exception.
Repeat step #1: re-build the ordered map.
Iterate through the array indices from 0 to initial_section_length-1, "withdrawing" elements from the map. For each index, choose an as-yet-unused element of A that's equal to the current element of B. (The existence of such an element is ensured by the first loop.)
For array index initial_section_length, choose the greatest as-yet-unused element of A that's less than the current element of B (and "withdraw" it from the map). (The existence of such an element is ensured by the first loop.)
Iterate through the array indices from initial_section_length+1 to n−1, continuing to "withdraw" elements from the map. For each index, choose the greatest element of A that hasn't been used yet.
This approach has the same time and space complexities as the backtracking-based approach.
There are n! permutations of A[n] (less if there are repeating elements).
Use binary search over range 0..n!-1 to determine k-th lexicographic permutation of A[] (arbitrary found example) which is closest lower one to B[].
Perhaps in C++ you can exploit std::lower_bound
Based on the discussion in the comment section to your question, you seek an array made up entirely of elements of the vector A that is -- in lexicographic ordering -- closest to the vector B.
For this scenario, the algorithm becomes quite straightforward. The idea is the same as as already mentioned in the answer of #ruakh (although his answer refers to an earlier and more complicated version of your question -- that is still displayed in the OP -- and is therefore more complicated):
Sort A
Loop over B and select the element of A that is closest to B[i]. Remove that element from the list.
If no element in A is smaller-or-equal than B[i], pick the largest element.
Here is the basic implementation:
#include <string>
#include <vector>
#include <algorithm>
auto get_closest_array(std::vector<int> A, std::vector<int> const& B)
{
std::sort(std::begin(A), std::end(A), std::greater<>{});
auto select_closest_and_remove = [&](int i)
{
auto it = std::find_if(std::begin(A), std::end(A), [&](auto x) { return x<=i;});
if(it==std::end(A))
{
it = std::max_element(std::begin(A), std::end(A));
}
auto ret = *it;
A.erase(it);
return ret;
};
std::vector<int> ret(B.size());
for(int i=0;i<(int)B.size();++i)
{
ret[i] = select_closest_and_remove(B[i]);
}
return ret;
}
Applied to the problem in the OP one gets:
int main()
{
std::vector<int> A ={1,3,5,6,7};
std::vector<int> B ={7,3,2,4,6};
auto C = get_closest_array(A, B);
for(auto i : C)
{
std::cout<<i<<" ";
}
std::cout<<std::endl;
}
and it displays
7 3 1 6 5
which seems to be the desired result.

Given a string, find two identical subsequences with consecutive indexes C++

I need to construct an algorithm (not necessarily effective) that given a string finds and prints two identical subsequences (by print I mean color for example). What more, the union of the sets of indexes of these two subsequences has to be a set of consecutive natural numbers (a full segment of integers).
In mathematics, the thing what I am looking for is called "tight twins", if it helps anything. (E.g., see the paper (PDF) here.)
Let me give a few examples:
1) consider string 231213231
It has two subsequences I am looking for in the form of "123". To see it better look at this image:
The first subsequence is marked with underlines and the second with overlines. As you can see they have all the properties I need.
2) consider string 12341234
3) consider string 12132344.
Now it gets more complicated:
4) consider string: 13412342
It is also not that easy:
I think that these examples explain well enough what I meant.
I've been thinking a long time about an algorithm that could do that but without success.
For coloring, I wanted to use this piece of code:
using namespace std;
HANDLE hConsole;
hConsole = GetStdHandle(STD_OUTPUT_HANDLE);
SetConsoleTextAttribute(hConsole, k);
where k is color.
Any help, even hints, would be highly appreciated.
Here's a simple recursion that tests for tight twins. When there's a duplicate, it splits the decision tree in case the duplicate is still part of the first twin. You'd have to run it on each substring of even length. Other optimizations for longer substrings could include hashing tests for char counts, as well as matching the non-duplicate portions of the candidate twins (characters that only appear twice in the whole substring).
Explanation of the function:
First, a hash is created with each character as key and the indexes it appears in as values. Then we traverse the hash: if a character count is odd, the function returns false; and indexes of characters with a count greater than 2 are added to a list of duplicates - characters half of which belong in one twin but we don't know which.
The basic rule of the recursion is to only increase i when a match for it is found later in the string, while maintaining a record of chosen matches (js) that i must skip without looking for a match. It works because if we find n/2 matches, in order, by the time j reaches the end, that's basically just another way of saying the string is composed of tight twins.
JavaScript code:
function isTightTwins(s){
var n = s.length,
char_idxs = {};
for (var i=0; i<n; i++){
if (char_idxs[s[i]] == undefined){
char_idxs[s[i]] = [i];
} else {
char_idxs[s[i]].push(i);
}
}
var duplicates = new Set();
for (var i in char_idxs){
// character with odd count
if (char_idxs[i].length & 1){
return false;
}
if (char_idxs[i].length > 2){
for (let j of char_idxs[i]){
duplicates.add(j);
}
}
}
function f(i,j,js){
// base case positive
if (js.size == n/2 && j == n){
return true;
}
// base case negative
if (j > n || (n - j < n/2 - js.size)){
return false;
}
// i is not less than j
if (i >= j) {
return f(i,j + 1,js);
}
// this i is in the list of js
if (js.has(i)){
return f(i + 1,j,js);
// yet to find twin, no match
} else if (s[i] != s[j]){
return f(i,j + 1,js);
} else {
// maybe it's a twin and maybe it's a duplicate
if (duplicates.has(j)) {
var _js = new Set(js);
_js.add(j);
return f(i,j + 1,js) | f(i + 1,j + 1,_js);
// it's a twin
} else {
js.add(j);
return f(i + 1,j + 1,js);
}
}
}
return f(0,1,new Set());
}
console.log(isTightTwins("1213213515")); // true
console.log(isTightTwins("11222332")); // false
WARNING: Commenter גלעד ברקן points out that this algorithm gives the wrong answer of 6 (higher than should be possible!) for the string 1213213515. My implementation gets the same wrong answer, so there seems to be a serious problem with this algorithm. I'll try to figure out what the problem is, but in the meantime DO NOT TRUST THIS ALGORITHM!
I've thought of a solution that will take O(n^3) time and O(n^2) space, which should be usable on strings of up to length 1000 or so. It's based on a tweak to the usual notion of longest common subsequences (LCS). For simplicity I'll describe how to find a minimal-length substring with the "tight twin" property that starts at position 1 in the input string, which I assume has length 2n; just run this algorithm 2n times, each time starting at the next position in the input string.
"Self-avoiding" common subsequences
If the length-2n input string S has the "tight twin" (TT) property, then it has a common subsequence with itself (or equivalently, two copies of S have a common subsequence) that:
is of length n, and
obeys the additional constraint that no character position in the first copy of S is ever matched with the same character position in the second copy.
In fact we can safely tighten the latter constraint to no character position in the first copy of S is ever matched to an equal or lower character position in the second copy, due to the fact that we will be looking for TT substrings in increasing order of length, and (as the bottom section shows) in any minimal-length TT substring, it's always possible to assign characters to the two subsequences A and B so that for any matched pair (i, j) of positions in the substring with i < j, the character at position i is assigned to A. Let's call such a common subsequence a self-avoiding common subsequence (SACS).
The key thing that makes efficient computation possible is that no SACS of a length-2n string can have more than n characters (since clearly you can't cram more than 2 sets of n characters into a length-2n string), so if such a length-n SACS exists then it must be of maximum possible length. So to determine whether S is TT or not, it suffices to look for a maximum-length SACS between S and itself, and check whether this in fact has length n.
Computation by dynamic programming
Let's define f(i, j) to be the length of the longest self-avoiding common subsequence of the length-i prefix of S with the length-j prefix of S. To actually compute f(i, j), we can use a small modification of the usual LCS dynamic programming formula:
f(0, _) = 0
f(_, 0) = 0
f(i>0, j>0) = max(f(i-1, j), f(i, j-1), m(i, j))
m(i, j) = (if S[i] == S[j] && i < j then 1 else 0) + f(i-1, j-1)
As you can see, the only difference is the additional condition && i < j. As with the usual LCS DP, computing it takes O(n^2) time, since the 2 arguments each range between 0 and n, and the computation required outside of recursive steps is O(1). (Actually we need only compute the "upper triangle" of this DP matrix, since every cell (i, j) below the diagonal will be dominated by the corresponding cell (j, i) above it -- though that doesn't alter the asymptotic complexity.)
To determine whether the length-2j prefix of the string is TT, we need the maximum value of f(i, 2j) over all 0 <= i <= 2n -- that is, the largest value in column 2j of the DP matrix. This maximum can be computed in O(1) time per DP cell by recording the maximum value seen so far and updating as necessary as each DP cell in the column is calculated. Proceeding in increasing order of j from j=1 to j=2n lets us fill out the DP matrix one column at a time, always treating shorter prefixes of S before longer ones, so that when processing column 2j we can safely assume that no shorter prefix is TT (since if there had been, we would have found it earlier and already terminated).
Let the string length be N.
There are two approaches.
Approach 1. This approach is always exponential-time.
For each possible subsequence of length 1..N/2, list all occurences of this subsequence. For each occurence, list positions of all characters.
For example, for 123123 it should be:
(1, ((1), (4)))
(2, ((2), (5)))
(3, ((3), (6)))
(12, ((1,2), (4,5)))
(13, ((1,3), (4,6)))
(23, ((2,3), (5,6)))
(123, ((1,2,3),(4,5,6)))
(231, ((2,3,4)))
(312, ((3,4,5)))
The latter two are not necessary, as their appear only once.
One way to do it is to start with subsequences of length 1 (i.e. characters), then proceed to subsequences of length 2, etc. At each step, drop all subsequences which appear only once, as you don't need them.
Another way to do it is to check all 2**N binary strings of length N. Whenever a binary string has not more than N/2 "1" digits, add it to the table. At the end drop all subsequences which appear only once.
Now you have a list of subsequences which appear more than 1 time. For each subsequence, check all the pairs, and check whether such a pair forms a tight twin.
Approach 2. Seek for tight twins more directly. For each N*(N-1)/2 substrings, check whether the substring is even length, and each character appears in it even number of times, and then, being its length L, check whether it contains two tight twins of the length L/2. There are 2**L ways to divide it, the simplest you can do is to check all of them. There are more interesting ways to seek for t.t., though.
I would like to approach this as a dynamic programming/pattern matching problem. We deal with characters one at a time, left to right, and we maintain a herd of Non-Deterministic Finite Automata / NDFA, which correspond to partial matches. We start off with a single null match, and with each character we extend each NDFA in every possible way, with each NDFA possibly giving rise to many children, and then de-duplicate the result - so we need to minimise the state held in the NDFA to put a bound on the size of the herd.
I think a NDFA needs to remember the following:
1) That it skipped a stretch of k characters before the match region.
2) A suffix which is a p-character string, representing characters not yet matched which will need to be matched by overlines.
I think that you can always assume that the p-character string needs to be matched with overlines because you can always swap overlines and underlines in an answer if you swap throughout the answer.
When you see a new character you can extend NDFAs in the following ways:
a) An NDFA with nothing except skips can add a skip.
b) An NDFA can always add the new character to its suffix, which may be null
c) An NDFA with a p character string whose first character matches the new character can turn into an NDFA with a p-1 character string which consists of the last p-1 characters of the old suffix. If the string is now of zero length then you have found a match, and you can work out what it was if you keep links back from each NDFA to its parent.
I thought I could use a neater encoding which would guarantee only a polynomial herd size, but I couldn't make that work, and I can't prove polynomial behaviour here, but I notice that some cases of degenerate behaviour are handled reasonably, because they lead to multiple ways to get to the same suffix.

Suffix Array Algorithm

After quite a bit of reading, I have figured out what a suffix array and LCP array represents.
Suffix array: Represents the _lexicographic rank of each suffix of an array.
LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works.
Here is the code , which is taken from Codeforces:
/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>
using namespace std;
#define REP(i, n) for (int i = 0; i < (int)(n); ++i)
namespace SuffixArray
{
const int MAXN = 1 << 21;
char * S;
int N, gap;
int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
void buildSA()
{
N = strlen(S);
REP(i, N) sa[i] = i, pos[i] = S[i];
for (gap = 1;; gap *= 2)
{
sort(sa, sa + N, sufCmp);
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
REP(i, N) pos[sa[i]] = tmp[i];
if (tmp[N - 1] == N - 1) break;
}
}
void buildLCP()
{
for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
{
for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
++k;
lcp[pos[i]] = k;
if (k)--k;
}
}
} // end namespace SuffixArray
I cannot, just cannot get through how this algorithm works. I tried working on an example using pencil and paper, and wrote through the steps involved, but lost link in between as its too complicated, for me at least.
Any help regarding explanation, using an example maybe, is highly appreciated.
Overview
This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort a 2-pass bucket sort had been used).
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2i-grams. There can obviously be no more than log2(n) such iterations, and the trick is that sorting the 2i-grams in the i-th step is facilitated by making sure that each comparison of two 2i-grams is done in O(1) time (rather than O(2i) time).
How does it do this? Well, in the first iteration it sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n) that stores, for each bigram, its rank in the bigram sorting.
Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'}. We then assign ranks (i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a new bigram changes. So the ranks we assign are as follows:
ab : 0
ab : 0 [no change to previous]
ca : 1 [increment because different from previous]
cd : 2 [increment because different from previous]
cd : 2 [no change to previous]
ea : 3 [increment because different from previous]
These ranks are known as lexicographic names.
Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking up the ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-grams.
After sorting, we create new lexicographic names again, this time for the 4-grams.
In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.
And so forth. Each iteration i has two steps:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
We repeat this until all 2i-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1, then each 2i-gram must have been given its own, distinct lexicographic name.
Implementation
Now let's look at the code to confirm all of this. The variables used are as follows: sa[] is the suffix array we are building. pos[] is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k] contains the lexicographic name of the k-th m-gram of the previous step. tmp[] is an auxiliary array used to help create pos[].
I'll give further explanations between the code lines:
void buildSA()
{
N = strlen(S);
/* This is a loop that initializes sa[] and pos[].
For sa[] we assume the order the suffixes have
in the given string. For pos[] we set the lexicographic
rank of each 1-gram using the characters themselves.
That makes sense, right? */
REP(i, N) sa[i] = i, pos[i] = S[i];
/* Gap is the length of the m-gram in each step, divided by 2.
We start with 2-grams, so gap is 1 initially. It then increases
to 2, 4, 8 and so on. */
for (gap = 1;; gap *= 2)
{
/* We sort by (gap*2)-grams: */
sort(sa, sa + N, sufCmp);
/* We compute the lexicographic rank of each m-gram
that we have sorted above. Notice how the rank is computed
by comparing each n-gram at position i with its
neighbor at i+1. If they are identical, the comparison
yields 0, so the rank does not increase. Otherwise the
comparison yields 1, so the rank increases by 1. */
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
/* tmp contains the rank by position. Now we map this
into pos, so that in the next step we can look it
up per m-gram, rather than by position. */
REP(i, N) pos[sa[i]] = tmp[i];
/* If the largest lexicographic name generated is
n-1, we are finished, because this means all
m-grams must have been different. */
if (tmp[N - 1] == N - 1) break;
}
}
About the comparison function
The function sufCmp is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap, which is a global variable.
A naive implementation of sufCmp would be this:
bool sufCmp(int i, int j)
{
int pos_i = sa[i];
int pos_j = sa[j];
int end_i = pos_i + 2*gap;
int end_j = pos_j + 2*gap;
if (end_i > N)
end_i = N;
if (end_j > N)
end_j = N;
while (i < end_i && j < end_j)
{
if (S[pos_i] != S[pos_j])
return S[pos_i] < S[pos_j];
pos_i += 1;
pos_j += 1;
}
return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}
This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i] with the one found at the beginning of the j-th suffix pos_j:=sa[j]. And it would compare them character by character, i.e. comparing S[pos_i] with S[pos_j], then S[pos_i+1] with S[pos_j+1] and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b in a function returning int means you return 1 if the condition is true, and 0 if it is false.)
The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i or pos_j will reach N before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorter one is lexicographically smaller. If pos_i has reached the end, the i-th suffix must be shorter than the j-th suffix.
Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
As you can see, instead of looking up individual characters S[i] and S[j], we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j], then the i-th suffix sa[i] must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]. In other words, simply by looking up pos[i] and pos[j] and comparing them, we have compared the first gap characters of the two suffixes.
If the ranks are identical, we continue by comparing pos[i+gap] with pos[j+gap]. This is the same as comparing the next gap characters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.
Example
The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.
The string we want to sort is abcxabcd. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S (the string), sa (the current state of the suffix array) and tmp and pos, which represent the lexicographic names.
First, we initialize:
S abcxabcd
sa 01234567
pos abcxabcd
Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.
First iteration:
Sorting sa, using bigrams as sorting criterion:
sa 04156273
The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.
Generating the lexicographic names:
tmp 00112345
As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).
Finally, we map this according to the positions in sa, to get pos:
sa 04156273
tmp 00112345
pos 01350124
(The way pos is generated is this: Go through sa from left to right, and use the entry to define the index in pos. Use the corresponding entry in tmp to define the value for that index. So pos[0]:=0, pos[4]:=0, pos[1]:=1, pos[5]:=1, pos[6]:=2, and so on. The index comes from sa, the value from tmp.)
Second iteration:
We sort sa again, and again we look at bigrams from pos (which each represents a sequence of two bigrams of the original string).
sa 04516273
Notice how the position of 1 5 have switched compared to the previous version of sa. It used to be 15, now it is 51. This is because the bigram at pos[1] and the bigram at pos[5] used to be identical (both bc) in during the previous iteration, but now the bigram at pos[5] is 12, while the bigram at pos[1] is 13. So position 5 comes before position 1. This is due to the fact that the lexicographic names now each represent bigrams of the original string: pos[5] represents bc and pos[6] represents 'cd'. So, together they represent bcd, while pos[1] represents bc and pos[2] represents cx, so together they represent bcx, which is indeed lexicographically greater than bcd.
Again, we generate lexicographic names by screening the current version of sa from left to right and comparing the corrsponding bigrams in pos:
tmp 00123456
The first two entries are still identical (both 0), because the corresponding bigrams in pos are both 01. The rest is an strictly increasing sequence of integers, because all other bigrams in pos are each unique.
We perform the mapping to the new pos as before (taking indices from sa and values from tmp):
sa 04516273
tmp 00123456
pos 02460135
Third iteration:
We sort sa again, taking bigrams of pos (as always), which now each represents a sequence of 4 bigrams of the orginal string.
sa 40516273
You'll notice that now the first two entries have switched positions: 04 has become 40. This is because the bigram at pos[0] is 02 while the one at pos[4] is 01, the latter obviously being lexicographically smaller. The deep reason is that these two represent abcx and abcd, respectively.
Generating lexicographic names yields:
tmp 01234567
They are all different, i.e. the highest one is 7, which is n-1. So, we are done, because are sorting is now based on m-grams that are all different. Even if we continued, the sorting order would not change.
Improvement suggestion
The algorithm used to sort the 2i-grams in each iteration appears to be the built-in sort (or std::sort). This means it's a comparison sort, which takes O(n log n) time in the worst case, in each iteration. Since there are log n iterations in the worst case, this makes it a O(n (log n)2)-time algorithm. However, the sorting could by performed using two passes of bucket sort, since the keys we use for the sort comparison (i.e. the lexicographic names of the previous step), form an increasing integer sequence. So this could be improved to an actual O(n log n)-time algorithm for suffix sorting.
Remark
I believe this is the original algorithm for suffix array construction that was suggested in the 1992-paper by Manber and Myers (link on Google Scholar; it should be the first hit, and it may have a link to a PDF there). This (at the same time, but independently of a paper by Gonnet and Baeza-Yates) was what introduced suffix arrays (also known as pat arrays at the time) as a data structure interesting for further study.
Modern algorithms for suffix array construction are O(n), so the above is no longer the best algorithm available (at least not in terms of theoretical, worst-case complexity).
Footnotes
(*) By 2-gram I mean a sequence of two consecutive characters of the original string. For example, when S=abcde is the string, then ab, bc, cd, de are the 2-grams of S. Similarly, abcd and bcde are the 4-grams. Generally, an m-gram (for a positive integer m) is a sequence of m consecutive characters. 1-grams are also called unigrams, 2-grams are called bigrams, 3-grams are called trigrams. Some people continue with tetragrams, pentagrams and so on.
Note that the suffix of S that starts and position i, is an (n-i)-gram of S. Also, every m-gram (for any m) is a prefix of one of the suffixes of S. Therefore, sorting m-grams (for an m as large as possible) can be the first step towards sorting suffixes.

Given an array of N numbers,find the number of sequences of all lengths having the range of R?

This is a follow up question to Given a sequence of N numbers ,extract number of sequences of length K having range less than R?
I basically need a vector v as an answer of size N such that V[i] denotes number of sequences of length i which have range <=R.
Traditionally, in recursive solutions, you would compute the solution for K = 0, K = 1, and then find some kind of recurrence relation between subsequent elements to avoid recomputing the solution from scratch each time.
However here I believe that maybe attacking the problem from the other side would be interesting, because of the property of the spread:
Given a sequence of spread R (or less), any subsequence has a spread inferior to R as well
Therefore, I would first establish a list of the longest subsequences of spread R beginning at each index. Let's call this list M, and have M[i] = j where j is the higher index in S (the original sequence) for which S[j] - S[i] <= R. This is going to be O(N).
Now, for any i, the number of sequences of length K starting at i is either 0 or 1, and this depends whether K is greater than M[i] - i or not. A simple linear pass over M (from 0 to N-K) gives us the answer. This is once again O(N).
So, if we call V the resulting vector, with V[k] denoting the number of subsequences of length K in S with spread inferior to R, then we can do it in a single iteration over M:
for i in [0, len(M)]:
for k in [0, M[i] - i]:
++V[k]
The algorithm is simple, however the number of updates can be rather daunting. In the worst case, supposing than M[i] - i equals N - i, it is O(N*N) complexity. You would need a better data structure (probably an adaptation of a Fenwick Tree) to use this algorithm an lower the cost of computing those numbers.
If you are looking for contiguous sequences, try doing it recursively : The K-length subsequences set having a range inferior than R are included in the (K-1)-length subsequences set.
At K=0, you have N solutions.
Each time you increase K, you append (resp. prepend) the next (resp.previous) element, check if it the range is inferior to R, and either store it in a set (look for duplicates !) or discard it depending on the result.
If think the complexity of this algorithm is O(n*n) in the worst-case scenario, though it may be better on average.
I think Matthieu has the right answer when looking for all sequences with spread R.
As you are only looking for sequences of length K, you can do a little better.
Instead of looking at the maximum sequence starting at i, just look at the sequence of length K starting at i, and see if it has range R or not. Do this for every i, and you have all sequences of length K with spread R.
You don't need to go through the whole list, as the latest start point for a sequence of length K is n-K+1. So complexity is something like (n-K+1)*K = n*K - K*K + K. For K=1 this is n,
and for K=n it is n. For K=n/2 it is n*n/2 - n*n/4 + n/2 = n*n/2 + n/2, which I think is the maximum. So while this is still O(n*n), for most values of K you get a little better.
Start with a simpler problem: count the maximal length of sequences, starting at each index and having the range, equal to R.
To do this, let first pointer point to the first element of the array. Increase second pointer (also starting from the first element of the array) while sequence between pointers has the range, less or equal to R. Push every array element, passed by second pointer, to min-max-queue, made of a pair of mix-max-stacks, described in this answer. When difference between max and min values, reported by min-max-queue exceeds R, stop increasing second pointer, increment V[ptr2-ptr1], increment first pointer (removing element, pointed by it, from min-max-queue), and continue increasing second pointer (keeping range under control).
When second pointer leaves bounds of the array, increment V[N-ptr1] for all remaining ptr1 (corresponding ranges may be less or equal to R). To add all other ranges, that are less than R, compute cumulative sum of array V[], starting from its end.
Both time and space complexities are O(N).
Pseudo-code:
p1 = p2 = 0;
do {
do {
min_max_queue.push(a[p2]);
++p2;
} while (p2 < N && min_max_queue.range() <= R);
if (p2 < N) {
++v[p2 - p1 - 1];
min_max_queue.pop();
++p1;
}
} while (p2 < N);
for (i = 1; i <= N-p1; ++i) {
++v[i];
}
sum = 0;
for (j = N; j > 0; --j) {
value = v[j];
v[j] += sum;
sum += value;
}