I have an array of set of permutations, and I want to remove isomorphic permutations.
We have S sets of permutations, where each set contain K permutations, and each permutation is represented as and array of N elements. I'm currently saving it as an array int pset[S][K][N], where S, K and N are fixed, and N is larger than K.
Two sets of permutations, A and B, are isomorphic, if there exists a permutation P, that converts elements from A to B (for example, if a is an element of set A, then P(a) is an element of set B). In this case we can say that P makes A and B isomorphic.
My current algorithm is:
We choose all pairs s1 = pset[i] and s2 = pset[j], such that i < j
Each element from choosen sets (s1 and s2) are numered from 1 to K. That means that each element can be represented as s1[i] or s2[i], where 0 < i < K+1
For every permutation T of K elements, we do the following:
Find the permutation R, such that R(s1[1]) = s2[1]
Check if R is a permutation that make s1 and T(s2) isomorphic, where T(s2) is a rearrangement of the elements (permutations) of the set s2, so basically we just check if R(s1[i]) = s2[T[i]], where 0 < i < K+1
If not, then we go to the next permutation T.
This algorithms works really slow: O(S^2) for the first step, O(K!) to loop through each permutation T, O(N^2) to find the R, and O(K*N) to check if the R is the permutation that makes s1 and s2 isomorphic - so it is O(S^2 * K! * N^2).
Question: Can we make it faster?
You can sort and compare:
// 1 - sort each set of permutation
for i = 0 to S-1
sort(pset[i])
// 2 - sort the array of permutations itself
sort(pset)
// 3 - compare
for i = 1 to S-1 {
if(areEqual(pset[i], pset[i-1]))
// pset[i] and pset[i-1] are isomorphic
}
A concrete example:
0: [[1,2,3],[3,2,1]]
1: [[2,3,1],[1,3,2]]
2: [[1,2,3],[2,3,1]]
3: [[3,2,1],[1,2,3]]
After 1:
0: [[1,2,3],[3,2,1]]
1: [[1,3,2],[2,3,1]] // order changed
2: [[1,2,3],[2,3,1]]
3: [[1,2,3],[3,2,1]] // order changed
After 2:
2: [[1,2,3],[2,3,1]]
0: [[1,2,3],[3,2,1]]
3: [[1,2,3],[3,2,1]]
1: [[1,3,2],[2,3,1]]
After 3:
(2, 0) not isomorphic
(0, 3) isomorphic
(3, 1) not isomorphic
What about the complexity?
1 is O(S * (K * N) * log(K * N))
2 is O(S * K * N * log(S * K * N))
3 is O(S * K * N)
So the overall complexity is O(S * K * N log(S * K * N))
There is a very simple solution for this: transposition.
If two sets are isomorphic, it means a one-to-one mapping exists, where the set of all the numbers at index i in set S1 equals the set of all the numbers at some index k in set S2. My conjecture is that no two non-isomorphic sets have this property.
(1) Jean Logeart's example:
0: [[1,2,3],[3,2,1]]
1: [[2,3,1],[1,3,2]]
2: [[1,2,3],[2,3,1]]
3: [[3,2,1],[1,2,3]]
Perform ONE pass:
Transpose, O(n):
0: [[1,3],[2,2],[3,1]]
Sort both in and between groups, O(something log something):
0: [[1,3],[1,3],[2,2]]
Hash:
"131322" -> 0
...
"121233" -> 1
"121323" -> 2
"131322" -> already hashed.
0 and 3 are isomorphic.
(2) vsoftco's counter-example in his comment to Jean Logeart's answer:
A = [ [0, 1, 2], [2, 0, 1] ]
B = [ [1, 0, 2], [0, 2, 1] ]
"010212" -> A
"010212" -> already hashed.
A and B are isomorphic.
You can turn each set into a transposed-sorted string or hash or whatever compressed object for linear-time comparison. Note that this algorithm considers all three sets A, B and C as isomorphic even if one p converts A to B and another p converts A to C. Clearly, in this case, there are ps to convert any one of these three sets to the other, since all we are doing is moving each i in one set to a specific k in the other. If, as you stated, your goal is to "remove isomorphic permutations," you will still get a list of sets to remove.
Explanation:
Assume that along with our sorted hash, we kept a record of which permutation each i came from. vsoftco's counter-example:
010212 // hash for A and B
100110 // origin permutation, set A
100110 // origin permutation, set B
In order to confirm isomorphism, we need to show that the i's grouped in each index from the first set moved to some index in the second set, which index does not matter. Sorting the groups of i's does not invalidate the solution, rather it serves to confirm movement/permutation between sets.
Now by definition, each number in a hash and each number in each group in the hash is represented in an origin permutation exactly one time for each set. However we choose to arrange the numbers in each group of i's in the hash, we are guaranteed that each number in that group is representing a different permutation in the set; and the moment we theoretically assign that number, we are guaranteed it is "reserved" for that permutation and index only. For a given number, say 2, in the two hashes, we are guaranteed that it comes from one index and permutation in set A, and in the second hash corresponds to one index and permutation in set B. That is all we really need to show - that the number in one index for each permutation in one set (a group of distinct i's) went to one index only in the other set (a group of distinct k's). Which permutation and index the number belongs to is irrelevant.
Remember that any set S2, isomorphic to set S1, can be derived from S1 using one permutation function or various combinations of different permutation functions applied to S1's members. What the sorting, or reordering, of our numbers and groups actually represents is the permutation we are choosing to assign as the solution to the isomorphism rather than an actual assignment of which number came from which index and permutation. Here is vsoftco's counter-example again, this time we will add the origin indexes of our hashes:
110022 // origin index set A
001122 // origin index set B
Therefore our permutation, a solution to the isomorphism, is:
Or, in order:
(Notice that in Jean Logeart's example there is more than one solution to the isomorphism.)
Suppose that two elements of s1, s2 \in S are isomorphic. Then if p1 and p2 are permutations, then s1 is isomorphic to s2 iff p1(s1) is isomorphic to p2(s2) where pi(si) is the set of permutations obtained by applying pi to every element in si.
For each i in 1...s and j in 1...k, choose the j-th member of si, and find the permutation that changes it to unity. Apply it to all the elements of si. Hash each of the k permutations to a number, obtaining k numbers, for any choice of i and j, at cost nk.
Comparing the hashed sets for two different values of i and j is k^2 < nk. Thus, you can find the set of candidate matches at cost s^2 k^3 n. If the actual number of matches is low, the overall complexity is far beneath what you specified in your question.
Take a0 in A. Then find it's inverse (fast, O(N)), call it a0inv. Then choose some i in B and define P_i = b_i * ainv and check that P_i * a generates B, when varying a over A. Do this for every i in B. If you don't find any i for which the relation holds, then the sets are not isomorphic. If you find such an i, then the sets are isomorphic. The runtime is O(K^2) for each pair of sets it checks, and you'd need to check O(S^2) sets, so you end up with O(S^2 * K^2 * N).
PS: I assumed here that by "maps A to B" you mean mapping under permutation composition, so P(a) is actually the permutation P composed with the permutation a, and I've used the fact that if P is a permutation, then there must exist an i for which Pa = b_i for some a.
EDIT I decided to undelete my answer as I am not convinced the previous one (#Jean Logeart) based on searching is correct. If yes, I'll gladly delete mine, as it performs worse, but I think I have a counterexample, see the comments below Jean's answer.
To check if two sets S₁ and S₂ are isomorphic you can do a much shorter search.
If they are isomorphic then there is a permutation t that maps each element of S₁ to an element of S₂; to find t you can just pick any fixed element p of S₁ and consider the permutations
t₁ = (1/p) q₁
t₂ = (1/p) q₂
t₃ = (1/p) q₃
...
for all elements q of S₂. For, if a valid t exists then it must map the element p to an element of S₂, so only permutations mapping p to an element of S₂ are possible candidates.
Moreover given a candidate t to check if two sets of permutations S₁t and S₂ are equal you could use an hash computed as the x-or of an hash code for each element, doing the full check of all the permutations only if the hash matches.
Related
You are given a permutation p1,p2,...,pn of numbers from 1 to n.
A permutation is a sequence of integers from 1 to n of length n containing each number exactly once.
You are given q queries where each query consists of two integers a and b, In response to each query you need to return a number of inversions of permutation after swapping elements at index a and b, Here every query is independent i.e. after each query the permutation is restored to its initial state.
An inversion in a permutation p is a pair of indices (i, j) such that i > j and pi < pj. For example, a permutation [4, 1, 3, 2] contains 4 inversions: (2, 1), (3, 1), (4, 1), (4, 3).
Input: The first line contains n,q.
The second line contains the space-separated permutation p1,p2,...,pn.
Each line of the next q lines contains two integers a,b.
Output: For each query Print an integer denoting the number of Inversion on a new line.
Sample input:
5 5
1 2 3 4 5
1 2
1 3
2 5
2 4
3 3
Output:
1
3
5
3
0
Constraints:
2<=n<=1000
1<=q<=200000
My approach: I am counting no of inversions using BIT (https://www.geeksforgeeks.org/count-inversions-array-set-3-using-bit/) for each query after swapping elements at position a and b..and then again swapping it so that my array remains unchanged. But this solution gives TLE for large test cases. Is there any better approach for this problem?
You are getting TLE probably because number of computations in this approach is q * (n * log(n)) = 2 * 10^5 * 10^3 * log(1000) = ~10^9, which is more than generally accepted computations ~10^8.
I can think of the following solution. Please note that I have not coded / verified it:
Denoting ri == number of indices j, such that i > j && pi < pj. Eg: [2, 3, 1, 4], r3 = 2. Basically, it means the number of inversions with the farther index as i. (Please note that I am using 1-based index as per the question. Also,a < b as per the question)
Thus we have: Sum of ri == #invs (number of inversions)
We can calculate initial total #invs in O(n^2)
When a and b are swapped, we can observe that:
a) ri remains constant, where i < a .
b) ri remains constant, where i > b.
Only ri changes where a <= i <=b, and that too on these following conditions. I am considering the case when pa < pb. Exact opposite case will need to considered when pa > pb.
a) Since pa < pb, thus this swap causes #invs = #invs + 1
b) If (pi < pa && pi < pb) || (pi > pa && pi > pb), this swap does not change ri. Eg: [2,....10,....5]. Here Swapping 2 and 5 does not change the r value for 10.
c) If pa < pi < pb, it will increment ri by 1, and new rb by 1. Eg: [2,....3,.....4], when 2 and 4 are swapped, we have [4,....3,....2], the rvalue 3 increases by 1 (because of 4); and also the r value of 2 increase by 1 (because of 3). Please note that increment because of what about 4 > 2? was already calculated in step (a), and needs to be done once only.
d) We need to find all such indicies i where pa < pi < pb as we started with above. Let us call it f(a, b). Then the total change in #invs delta = (2 * f(a, b)) + 1, and answer will be #original_invs + delta.
As I mentioned, all the exact opposite steps need to be done for the case pa > pb. The delta will be negative in that case.
Now, the only thing remained is to solve: Given a, b, find f(a, b) efficiently. For this, we can pre-process and store it for all pairs of indices. This will take O(N^2) space, and O(N^2 * log(N)) time, using a balanced binary-search-tree (BST). Again showing steps for pre-processing for case pa < pb only. Another set of pre-processing steps needs to be done for the other case:
We will use self-balancing BST, in which each node also contains the following fields:
a) field_1: This denotes the size of the left sub-tree. This value will be updated on every insert operation, if size of left-sub-tree changes.
b) field_2: This denotes the number of elements < node.value that this tree has. This value is initialized once when the node is inserted and does not change thereafter. I have added a small explanation of how it will be achieved in Addendum-A. This field is basically our pre-processing, which will determine f(a, b).
With all of this now, for each index i, where 0 <= i < n, do the following: Create new tree. Insert pj values into the tree one by one, where (i < j < n ) && (pa < pj) . (Please note we are not inserting values where pa > pj). The method given in Addendum-A will make sure we find f(i, j) while inserting.
There will be n such pre-processed trees, one for every index. For finding f(a, b): We need to look into ath tree, and search node.value = pb. This node's field_2 = f(a, b).
The complexity of insertion is O(logN). So, the total pre-processing computation = O(N * N(logN)). Search is O(logN), so the query complexity is O(q * logN). Total complexity = O(N^2) + O(N * N (logN)) + O(q * logN) which will turn out ~10^7
==============================================================================
Addendum A: How to populate field_2 while inserting node:
i) Insert the node, and balance the tree. Update field_1 as required.
i) Initailze ans = 0. Traverse the BST from root searching for your node.
iii) do {
If node.value < search_key_b, ans += node.left_subtree_size + 1
} while(!node.found)
iv) ans -= 1
We can solve this in O(n log n) space and O(n log n + Q * log^2(n)) time with a merge-sort tree. The merge-sort tree allows us to find the number of elements inside a subarray that are greater than or lower than an input number in O(log^2(n)) time and O(n log n) space.
First we record the total number of inversions in O(n log n) time, for which there are known methods. To query the effect of a swap bound by left and right, consider the subarray between:
subtract the number of elements greater
than right in the subarray (those will
no longer be inversions)
subtract the number of elements smaller
than left in the subarray (those will
no longer be inversions)
add the number of elements greater
than left in the subarray (those will
be new inversions)
add the number of elements smaller
than right in the subarray (those will
be new inversions)
if right > left, add 1
if left > right, subtract 1
I'm given 2 lists, a and b. Both them contain only integers. min(a) > 0, max(a) can be upto 1e10 and max(abs(b)) can be upto 1e5. I need to find the number of tuples (x, y, z), where x is in a and y, z are in b such that x = -yz. The number of elements in a and b can be upto 1e5.
My attempt:
I was able to come up with a naive n^2 algorithm. But, since the size can be upto 1e5, I need to come up with a nlogn solution (max) instead. What I did was:
Split b into bp and bn where the first one contains all the positive numbers and second one contains all the negative numbers and created their maps.
Then:
2.1 I iterate over a to get x's.
2.2 Iterate over the shorter one of bn and bp. Check if the current element divides x. If yes, use map.find() to see if z = -x/y is present or not.
What could be an efficient way to do this?
There's no O(n*logn) because: z = -x/y <=> log(z) = log(-x) - log(y)
As https://stackoverflow.com/users/12299000/kaya3 has mentioned, it is 3SUM#3_different_arrays. According to wikipedia:
Kane, Lovett, and Moran showed that the 6-linear decision tree complexity of 3SUM is O(n*log^2n)
Step 1: Sort the elements in list b (say bsorted)
Step 2: For a value x in a, go through the list bsorted for every value y in bsorted and binary search for (-x/y) on bsorted to find z
Complexity |a|=m and |b|=n complexity is O(mnlogn)
Here's an untested idea. Create a trie from the elements of b, where the "characters" are ordered prime numbers. For each element in a, walk all valid paths in the trie (DFS or BFS, where the test is being able to divide further by the current node), and for each leaf reached, check if the remaining element (after dividing at each node) exists in b. (We may need to handle duplicates by storing counts of each "word" and using simple combinatorics.)
http://www.spoj.com/problems/LSORT/ It is a problem on spoj
It states that
You are given a permutation of n numbers that are between 1 to n and having no duplicates.
Task is to sort that permutation in ascending order.There is another array Q in which we are inserting elements from given permutation P.
You have to implement N steps to sort P. In the i-th step, P has N-i+1 remaining elements, Q has i-1 elements and you have to choose some x-th element (from the N-i+1 available elements) of P and put it to the left or to the right of Q. The cost of this step is equal to x * i. The total cost is the sum of costs of individual steps. After N steps, Q must be an ascending sequence. Your task is to minimize the total cost.
Input
The first line of the input file is T (T ≤ 10), the number of test cases. Then descriptions of T test cases follow. The description of each test case consists of two lines. The first line contains a single integer N (1 ≤ N ≤ 1000). The second line contains N distinct integers from the set {1, 2, .., N}, the N-element permutation P.
Output
For each test case your program should write one line, containing a single integer - the minimum total cost of sorting.
Now i have figured out the dp
My recurrence relation states that for getting most optimal values from elements having value i to j i will have to insert either $i$ at front or $j$ at back.
Cost of inserting i at front = dp[i+1][j]+cost of adding element i at front
Cost of inserting j at back = dp[i][j-1] +cost of adding element j at back
and i have to take minimum of these.answer would be dp[1][n]
for(l=1;l<=n;l++) //length of current permutation Q
{
for(i=1;i<=n-l+1;i++) //starting value of permutation Q
{
j=i+l-1; //ending value of permutation Q
dp[i][j]=min(dp[i+1][j]+l*xi,dp[i][j-1]+l*xj);//chosing wether to insert i at start or j at end
}
}
here xi=index of element i from start of permutation P.
and yi=index of element j from start of permutation P.
ans would be dp[1][n]
But am unable to figure out xi and xj
Please help
You can try re-thinking your DP state.
For me, I would use the dp[startQ][endQ] where dp[startQ][endQ] means the cost I have incurred to far to 'sort' values startQ to endQ in the array Q.
If you know what is in the array Q (integers startQ to endQ inclusive), one can easily re-construct the array of P by just removing/ignoring all the integers within startQ and endQ.
For each state, dp[startQ][endQ], since one can only add to the front or the back of Q,
dp[startQ][endQ] can only be:
dp[startQ][endQ-1] + cost of adding endQ
dp[startQ-1][endQ] + cost of adding startQ
with the base cases being
dp[i][i] = 0;
These states can be computed and the answer can be found at dp[1]][n]; (assuming it is one indexed).
However I haven't thought of a efficient way to compute x if it were to be coded in a top down manner, where as the whole computation can be performed in O(N^2 log N) using bottom-up DP with a data structure to compute x at every state.
I will leave the final details for you to code out :) but I can help more if required.
After quite a bit of reading, I have figured out what a suffix array and LCP array represents.
Suffix array: Represents the _lexicographic rank of each suffix of an array.
LCP array : Contains the maximum length prefix match between two consecutive suffixes, after they are sorted lexicographically.
I have been trying hard to understand since a couple of days , how exactly the suffix array and LCP algorithm works.
Here is the code , which is taken from Codeforces:
/*
Suffix array O(n lg^2 n)
LCP table O(n)
*/
#include <cstdio>
#include <algorithm>
#include <cstring>
using namespace std;
#define REP(i, n) for (int i = 0; i < (int)(n); ++i)
namespace SuffixArray
{
const int MAXN = 1 << 21;
char * S;
int N, gap;
int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN];
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
void buildSA()
{
N = strlen(S);
REP(i, N) sa[i] = i, pos[i] = S[i];
for (gap = 1;; gap *= 2)
{
sort(sa, sa + N, sufCmp);
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
REP(i, N) pos[sa[i]] = tmp[i];
if (tmp[N - 1] == N - 1) break;
}
}
void buildLCP()
{
for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1)
{
for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];)
++k;
lcp[pos[i]] = k;
if (k)--k;
}
}
} // end namespace SuffixArray
I cannot, just cannot get through how this algorithm works. I tried working on an example using pencil and paper, and wrote through the steps involved, but lost link in between as its too complicated, for me at least.
Any help regarding explanation, using an example maybe, is highly appreciated.
Overview
This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort a 2-pass bucket sort had been used).
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2i-grams. There can obviously be no more than log2(n) such iterations, and the trick is that sorting the 2i-grams in the i-th step is facilitated by making sure that each comparison of two 2i-grams is done in O(1) time (rather than O(2i) time).
How does it do this? Well, in the first iteration it sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n) that stores, for each bigram, its rank in the bigram sorting.
Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'}. We then assign ranks (i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a new bigram changes. So the ranks we assign are as follows:
ab : 0
ab : 0 [no change to previous]
ca : 1 [increment because different from previous]
cd : 2 [increment because different from previous]
cd : 2 [no change to previous]
ea : 3 [increment because different from previous]
These ranks are known as lexicographic names.
Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking up the ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-grams.
After sorting, we create new lexicographic names again, this time for the 4-grams.
In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.
And so forth. Each iteration i has two steps:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
We repeat this until all 2i-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1, then each 2i-gram must have been given its own, distinct lexicographic name.
Implementation
Now let's look at the code to confirm all of this. The variables used are as follows: sa[] is the suffix array we are building. pos[] is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k] contains the lexicographic name of the k-th m-gram of the previous step. tmp[] is an auxiliary array used to help create pos[].
I'll give further explanations between the code lines:
void buildSA()
{
N = strlen(S);
/* This is a loop that initializes sa[] and pos[].
For sa[] we assume the order the suffixes have
in the given string. For pos[] we set the lexicographic
rank of each 1-gram using the characters themselves.
That makes sense, right? */
REP(i, N) sa[i] = i, pos[i] = S[i];
/* Gap is the length of the m-gram in each step, divided by 2.
We start with 2-grams, so gap is 1 initially. It then increases
to 2, 4, 8 and so on. */
for (gap = 1;; gap *= 2)
{
/* We sort by (gap*2)-grams: */
sort(sa, sa + N, sufCmp);
/* We compute the lexicographic rank of each m-gram
that we have sorted above. Notice how the rank is computed
by comparing each n-gram at position i with its
neighbor at i+1. If they are identical, the comparison
yields 0, so the rank does not increase. Otherwise the
comparison yields 1, so the rank increases by 1. */
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
/* tmp contains the rank by position. Now we map this
into pos, so that in the next step we can look it
up per m-gram, rather than by position. */
REP(i, N) pos[sa[i]] = tmp[i];
/* If the largest lexicographic name generated is
n-1, we are finished, because this means all
m-grams must have been different. */
if (tmp[N - 1] == N - 1) break;
}
}
About the comparison function
The function sufCmp is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap, which is a global variable.
A naive implementation of sufCmp would be this:
bool sufCmp(int i, int j)
{
int pos_i = sa[i];
int pos_j = sa[j];
int end_i = pos_i + 2*gap;
int end_j = pos_j + 2*gap;
if (end_i > N)
end_i = N;
if (end_j > N)
end_j = N;
while (i < end_i && j < end_j)
{
if (S[pos_i] != S[pos_j])
return S[pos_i] < S[pos_j];
pos_i += 1;
pos_j += 1;
}
return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}
This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i] with the one found at the beginning of the j-th suffix pos_j:=sa[j]. And it would compare them character by character, i.e. comparing S[pos_i] with S[pos_j], then S[pos_i+1] with S[pos_j+1] and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b in a function returning int means you return 1 if the condition is true, and 0 if it is false.)
The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i or pos_j will reach N before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorter one is lexicographically smaller. If pos_i has reached the end, the i-th suffix must be shorter than the j-th suffix.
Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
As you can see, instead of looking up individual characters S[i] and S[j], we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j], then the i-th suffix sa[i] must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]. In other words, simply by looking up pos[i] and pos[j] and comparing them, we have compared the first gap characters of the two suffixes.
If the ranks are identical, we continue by comparing pos[i+gap] with pos[j+gap]. This is the same as comparing the next gap characters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.
Example
The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.
The string we want to sort is abcxabcd. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S (the string), sa (the current state of the suffix array) and tmp and pos, which represent the lexicographic names.
First, we initialize:
S abcxabcd
sa 01234567
pos abcxabcd
Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.
First iteration:
Sorting sa, using bigrams as sorting criterion:
sa 04156273
The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.
Generating the lexicographic names:
tmp 00112345
As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).
Finally, we map this according to the positions in sa, to get pos:
sa 04156273
tmp 00112345
pos 01350124
(The way pos is generated is this: Go through sa from left to right, and use the entry to define the index in pos. Use the corresponding entry in tmp to define the value for that index. So pos[0]:=0, pos[4]:=0, pos[1]:=1, pos[5]:=1, pos[6]:=2, and so on. The index comes from sa, the value from tmp.)
Second iteration:
We sort sa again, and again we look at bigrams from pos (which each represents a sequence of two bigrams of the original string).
sa 04516273
Notice how the position of 1 5 have switched compared to the previous version of sa. It used to be 15, now it is 51. This is because the bigram at pos[1] and the bigram at pos[5] used to be identical (both bc) in during the previous iteration, but now the bigram at pos[5] is 12, while the bigram at pos[1] is 13. So position 5 comes before position 1. This is due to the fact that the lexicographic names now each represent bigrams of the original string: pos[5] represents bc and pos[6] represents 'cd'. So, together they represent bcd, while pos[1] represents bc and pos[2] represents cx, so together they represent bcx, which is indeed lexicographically greater than bcd.
Again, we generate lexicographic names by screening the current version of sa from left to right and comparing the corrsponding bigrams in pos:
tmp 00123456
The first two entries are still identical (both 0), because the corresponding bigrams in pos are both 01. The rest is an strictly increasing sequence of integers, because all other bigrams in pos are each unique.
We perform the mapping to the new pos as before (taking indices from sa and values from tmp):
sa 04516273
tmp 00123456
pos 02460135
Third iteration:
We sort sa again, taking bigrams of pos (as always), which now each represents a sequence of 4 bigrams of the orginal string.
sa 40516273
You'll notice that now the first two entries have switched positions: 04 has become 40. This is because the bigram at pos[0] is 02 while the one at pos[4] is 01, the latter obviously being lexicographically smaller. The deep reason is that these two represent abcx and abcd, respectively.
Generating lexicographic names yields:
tmp 01234567
They are all different, i.e. the highest one is 7, which is n-1. So, we are done, because are sorting is now based on m-grams that are all different. Even if we continued, the sorting order would not change.
Improvement suggestion
The algorithm used to sort the 2i-grams in each iteration appears to be the built-in sort (or std::sort). This means it's a comparison sort, which takes O(n log n) time in the worst case, in each iteration. Since there are log n iterations in the worst case, this makes it a O(n (log n)2)-time algorithm. However, the sorting could by performed using two passes of bucket sort, since the keys we use for the sort comparison (i.e. the lexicographic names of the previous step), form an increasing integer sequence. So this could be improved to an actual O(n log n)-time algorithm for suffix sorting.
Remark
I believe this is the original algorithm for suffix array construction that was suggested in the 1992-paper by Manber and Myers (link on Google Scholar; it should be the first hit, and it may have a link to a PDF there). This (at the same time, but independently of a paper by Gonnet and Baeza-Yates) was what introduced suffix arrays (also known as pat arrays at the time) as a data structure interesting for further study.
Modern algorithms for suffix array construction are O(n), so the above is no longer the best algorithm available (at least not in terms of theoretical, worst-case complexity).
Footnotes
(*) By 2-gram I mean a sequence of two consecutive characters of the original string. For example, when S=abcde is the string, then ab, bc, cd, de are the 2-grams of S. Similarly, abcd and bcde are the 4-grams. Generally, an m-gram (for a positive integer m) is a sequence of m consecutive characters. 1-grams are also called unigrams, 2-grams are called bigrams, 3-grams are called trigrams. Some people continue with tetragrams, pentagrams and so on.
Note that the suffix of S that starts and position i, is an (n-i)-gram of S. Also, every m-gram (for any m) is a prefix of one of the suffixes of S. Therefore, sorting m-grams (for an m as large as possible) can be the first step towards sorting suffixes.
Given two sorted lists, each containing n real numbers, is there a O(log n) time algorithm to compute the element of rank i (where i coresponds to index in increasing order) in the union of the two lists, assuming the elements of the two lists are distinct?
EDIT:
#BEN: This i s what I have been doing , but I am still not getting it.
I have an examples ;
List A : 1, 3, 5, 7
List B : 2, 4, 6, 8
Find rank(i) = 4.
First Step : i/2 = 2;
List A now contains is A: 1, 3
List B now contains is B: 2, 4
compare A[i] to B[i] i.e
A[i] is less;
So the lists now become :
A: 3
B: 2,4
Second Step:
i/2 = 1
List A now contains A:3
List B now contains B:2
NoW I HAVE LOST THE VALUE 4 which is actually the result ...
I know I am missing some thing , but even after close to a day of thinking I cant just figure this one out...
Yes:
You know the element lies within either index [0,i] of the first list or [0,i] of the second list. Take element i/2 from each list and compare. Proceed by bisection.
I'm not including any code because this problem sounds a lot like homework.
EDIT: Bisection is the method behind binary search. It works like this:
Assume i = 10; (zero-based indexing, we're looking for the 11th element overall).
On the first step, you know the answer is either in list1(0...10) or list2(0...10). Take a = list1(5) and b = list2(5).
If a > b, then there are 5 elements in list1 which come before a, and at least 6 elements in list2 which come before a. So a is an upper bound on the result. Likewise there are 5 elements in list2 which come before b and less than 6 elements in list1 which come before b. So b is an lower bound on the result. Now we know that the result is either in list1(0..5) or list2(5..10). If a < b, then the result is either in list1(5..10) or list2(0..5). And if a == b we have our answer (but the problem said the elements were distinct, therefore a != b).
We just repeat this process, cutting the size of the search space in half at each step. Bisection refers to the fact that we choose the middle element (bisector) out of the range we know includes the result.
So the only difference between this and binary search is that in binary search we compare to a value we're looking for, but here we compare to a value from the other list.
NOTE: this is actually O(log i) which is better (at least no worse than) than O(log n). Furthermore, for small i (perhaps i < 100), it would actually be fewer operations to merge the first i elements (linear search instead of bisection) because that is so much simpler. When you add in cache behavior and data locality, the linear search may well be faster for i up to several thousand.
Also, if i > n then rely on the fact that the result has to be toward the end of either list, your initial candidate range in each list is from ((i-n)..n)
Here is how you do it.
Let the first list be ListX and the second list be ListY. We need to find the right combination of ListX[x] and ListY[y] where x + y = i. Since x, y, i are natural numbers we can immediately constrain our problem domain to x*y. And by using the equations max(x) = len(ListX) and max(y) = len(ListY) we now have a subset of x*y elements in the form [x, y] that we need to search.
What you will do is order those elements like so [i - max(y), max(y)], [i - max(y) + 1, max(y) - 1], ... , [max(x), i - max(x)]. You will then bisect this list by choosing the middle [x, y] combination. Since the lists are ordered and distinct you can test ListX[x] < ListY[y]. If true then we bisect the upper half our [x, y] combinations or if false then we bisect the lower half. You will keep bisecting until find the right combination.
There are a lot of details I left, but that is the general gist of it. It is indeed O(log(n))!
Edit: As Ben pointed out this actually O(log(i)). If we let n = len(ListX) + len(ListY) then we know that i <= n.
When merging two lists, you're going to have to touch every element in both lists. If you don't touch every element, some elements will be left behind. Thus your theoretical lower bound is O(n). So you can't do it that way.
You don't have to sort, since you have two lists that are already sorted, and you can maintain that ordering as part of the merge.
edit: oops, I misread the question. I thought given value, you want to find rank, not the other way around. If you want to find rank given value, then this is how to do it in O(log N):
Yes, you can do this in O(log N), if the list allows O(1) random access (i.e. it's an array and not a linked list).
Binary search on L1
Binary search on L2
Sum the indices
You'd have to work out the math, +1, -1, what to do if element isn't found, etc, but that's the idea.