Implementation of suffix array in c++ - c++

#include<iostream>
#include<string.h>
#include<utility>
#include<algorithm>
using namespace std;
struct xx
{
string x;
short int d;
int lcp;
};
bool compare(const xx a,const xx b)
{
return a.x<b.x;
}
int findlcp(string a,string b)
{
int i=0,j=0,k=0;
while(i<a.length() && j<b.length())
{
if(a[i]==b[j])
{
k++;
i++;
j++;
}
else
{
break;
}
}
return k;
}
int main()
{
string a="banana";
xx b[100];
a=a+'$';
int len=a.length();
for(int i=0;i<len;i++)
{
b[i].x=a.substr(i);
b[i].d=i+1;
}
sort(b,b+len,compare);
for(int i=0;i<len;i++)
cout<<b[i].x<<" "<<b[i].d<<endl;
b[0].lcp=0;
b[1].lcp=0;
for(int i=2;i<len;i++)
{
b[i].lcp=findlcp(b[i].x,b[i-1].x);
}
for(int i=0;i<len;i++)
cout<<b[i].d<<" "<<b[i].lcp<<endl;
}
This is a implementation of
Suffix Array. What my question is in the wikipedia article construction is given as o(n) in worst case
So in my construction:
I am sorting all the suffixes of the string using stl sort .This may at least a O(nlogn) in worst case.So here i am violating O(n) construction.
Second one is in constructing a longest common prefix array construction is given O(n).But i think my implementation in O(n^2)
So for the 1st one i.e for the sorting
If i use count sort i may decrease to O(n).If i use Count sort is it correct?Is my understanding is correct?let me know if my understanding is wrong
And is there any way to find LCP in O(n) time?

First, regarding your two statements:
1) I am sorting all the suffixes of the string using stl sort. This may at least a O(nlogn) in worst case. So here i am violating O(n) construction.
The complexity of std::sort here is worse than O(n log n). The reason is that O(n log n) assumes that there are O(n log n) individual comparisons, and that each comparison is performed in O(1) time. The latter assumption is wrong, because you are sorting strings, not atomic items (like characters or integers).
Since the length of the string items, being substrings of the main string, is O(n), it would be safe to say that the worst-case complexity of your sorting algorithm is O(n2 log n).
2) Second one is in constructing a longest common prefix array construction is given O(n).But i think my implementation in O(n^2)
Yes, your construction of the LCP array is O(n2) because you are running your lcp function n == len times, and your lcp function requires O(min(len(x),len(y))) time for a pair of strings x, y.
Next, regarding your questions:
If I use count sort I may decrease to O(n). If I use Count sort is it correct? Is my understanding is correct? Let me know if my understanding is wrong.
Unfortunately, your understanding is incorrect. Counting sort is only linear if you can, in O(1) time, get access to an atomic key for each item you want to sort. Again, the items are strings O(n) characters in length, so this won't work.
And is there any way to find LCP in O(n) time?
Yes. Recent algorithms for suffix array computation, including the DC algorithm (aka Skew algorithm), provide for methods to calculate the LCP array along with the suffix array, and do so in O(n) time.
The reference for the DC algorithm is Juha Kärkkäinen, Peter Sanders: Simple linear work suffix array construction, Automata, Languages and Programming
Lecture Notes in Computer Science Volume 2719, 2003, pp 943-955 (DOI 10.1007/3-540-45061-0_73). (But this is not the only algorithm that allows you to do this in linear time.)
You may also want to take a look at the open-source implementations mentioned in this SO post: What's the current state-of-the-art suffix array construction algorithm?. Many of the algorithms used there enable linear-time LCP array construction in addition to the suffix-array construction (but not all of the implementations there may actually include an implementation of that; I am not sure).
If you are ok with examples in Java, you may also want to look at the code for jSuffixArrays. It includes, among other algorithms, an implementation of the DC algorithm along with LCP array construction in linear time.

jogojapan has comprehensively answered your question. Just to mention an optimized cpp implementation, you might want to take a look at here.
Posting the code here in case GitHub goes down.
const int N = 1000 * 100 + 5; //max string length
namespace Suffix{
int sa[N], rank[N], lcp[N], gap, S;
bool cmp(int x, int y) {
if(rank[x] != rank[y])
return rank[x] < rank[y];
x += gap, y += gap;
return (x < S && y < S)? rank[x] < rank[y]: x > y;
}
void Sa_build(const string &s) {
S = s.size();
int tmp[N] = {0};
for(int i = 0;i < S;++i)
rank[i] = s[i],
sa[i] = i;
for(gap = 1;;gap <<= 1) {
sort(sa, sa + S, cmp);
for(int i = 1;i < S;++i)
tmp[i] = tmp[i - 1] + cmp(sa[i - 1], sa[i]);
for(int i = 0;i < S;++i)
rank[sa[i]] = tmp[i];
if(tmp[S - 1] == S - 1)
break;
}
}
void Lcp_build() {
for(int i = 0, k = 0;i < S;++i, --k)
if(rank[i] != S - 1) {
k = max(k, 0);
while(s[i + k] == s[sa[rank[i] + 1] + k])
++k;
lcp[rank[i]] = k;
}
else
k = 0;
}
};

Related

Sort array of n elements which has k sorted sections

What is the best way to sort an section-wise sorted array as depicted in the second image?
The problem is performing a quick-sort using Message Passing Interface. The solution is performing quick-sort on array sections obtained by using MPI_Scatter() then joining the sorted
pieces using MPI_Gather().
Problem is that the array as a whole is unsorted but sections of it are.
Merging the sub-sections similarly to this solution seems like the best way of sorting the array, but considering that the sub-arrays are already within a single array other sorting algorithms may prove better.
The inputs for a sort function would be the array, it's length and the number of equally sorted sub-sections.
A signature would look something like int* sort(int* array, int length, int sections);
The sections parameter can have any value between 1 and 25. The length parameter value is greater than 0, a multiple of sections and smaller than 2^32.
This is what I am currently using:
int* merge(int* input, int length, int sections)
{
int* sub_sections_indices = new int[sections];
int* result = new int[length];
int section_size = length / sections;
for (int i = 0; i < sections; i++) //initialisation
{
sub_sections_indices[i] = 0;
}
int min, min_index, current_index;
for (int i = 0; i < length; i++) //merging
{
min_index = 0;
min = INT_MAX;
for (int j = 0; j < sections; j++)
{
if (sub_sections_indices[j] < section_size)
{
current_index = j * section_size + sub_sections_indices[j];
if (input[current_index] < min)
{
min = input[current_index];
min_index = j;
}
}
}
sub_sections_indices[min_index]++;
result[i] = min;
}
return result;
}
Optimizing for performance
I think this answer that maintains a min-heap of the smallest item of each sub-array is the best way to handle arbitrary input. However, for small values of k, think somewhere between 10 and 100, it might be faster to implement the more naive solutions given in the question you linked to; while maintaining the min-heap is only O(log n) for each step, it might have a higher overhead for small values of n than the simple linear scan from the naive solutions.
All these solutions create a copy of the input, and they maintain O(k) state.
Optimizing for space
The only way to save space I see is to sort in-place. This will be a problem for the algorithms mentioned above. An in-place algorithm will have two swap elements, but any swaps will likely destroy the property that each sub-array is sorted, unless the larger of the swapped pair is re-sorted into the sub-array it is being swapped to, which will result in an O(n²) algorithm. So if you really do need to conserve memory, I think a regular in-place sorting algorithm would have to be used, which defeats your purpose.

how to find distinct substrings?

Given a string, and a fixed length l, how can I count the number of distinct substrings whose length is l?
The size of character set is also known. (denote it as s)
For example, given a string "PccjcjcZ", s = 4, l = 3,
then there are 5 distinct substrings:
“Pcc”; “ccj”; “cjc”; “jcj”; “jcZ”
I try to use hash table, but the speed is still slow.
In fact I don't know how to use the character size.
I have done things like this
int diffPatterns(const string& src, int len, int setSize) {
int cnt = 0;
node* table[1 << 15];
int tableSize = 1 << 15;
for (int i = 0; i < tableSize; ++i) {
table[i] = NULL;
}
unsigned int hashValue = 0;
int end = (int)src.size() - len;
for (int i = 0; i <= end; ++i) {
hashValue = hashF(src, i, len);
if (table[hashValue] == NULL) {
table[hashValue] = new node(i);
cnt ++;
} else {
if (!compList(src, i, table[hashValue], len)) {
cnt ++;
};
}
}
for (int i = 0; i < tableSize; ++i) {
deleteList(table[i]);
}
return cnt;
}
Hastables are fine and practical, but keep in mind that if the length of substrings is L, and the whole string length is N, then the algorithm is Theta((N+1-L)*L) which is Theta(NL) for most L. Remember, just computing the hash takes Theta(L) time. Plus there might be collisions.
Suffix trees can be used, and provide a guaranteed O(N) time algorithm (count number of paths at depth L or greater), but the implementation is complicated. Saving grace is you can probably find off the shelf implementations in the language of your choice.
The idea of using a hashtable is good. It should work well.
The idea of implementing your own hashtable as an array of length 2^15 is bad. See Hashtable in C++? instead.
You can use an unorder_set and insert the strings into the set and then get the size of the set. Since the values in a set are unique it will take care of not including substrings that are the same as ones previously found. This should give you close to O(StringSize - SubstringSize) complexity
#include <iostream>
#include <string>
#include <unordered_set>
int main()
{
std::string test = "PccjcjcZ";
std::unordered_set<std::string> counter;
size_t substringSize = 3;
for (size_t i = 0; i < test.size() - substringSize + 1; ++i)
{
counter.insert(test.substr(i, substringSize));
}
std::cout << counter.size();
std::cin.get();
return 0;
}
Veronica Kham answered good to the question, but we can improve this method to expected O(n) and still use a simple hash table rather than suffix tree or any other advanced data structure.
Hash function
Let X and Y are two adjacent substrings of length L, more precisely:
X = A[i, i + L - 1]
Y = B[i + 1, i + 1 + L - 1]
Let assign to each letter of our alphabet a single non negative integer, for example a := 1, b := 2 and so on.
Let's define a hash function h now:
h(A[i, j]) := (P^(L-1) * A[i] + P^(L-2) * A[i + 1] + ... + A[j]) % M
where P is a prime number ideally greater than the alphabet size and M is a very big number denoting the number of different possible hashes, for example you can set M to maximum available unsigned long long int in your system.
Algorithm
The crucial observation is the following:
If you have a hash computed for X, you can compute a hash for Y in
O(1) time.
Let assume that we have computed h(X), which can be done in O(L) time obviously. We want to compute h(Y). Notice that since X and Y differ by only 2 characters, and we can do that easily using addition and multiplication:
h(Y) = ((h(X) - P^L * A[i]) * P) + A[j + 1]) % M
Basically, we are subtracting letter A[i] multiplied by its coefficient in h(X), multiplying the result by P in order to get proper coefficients for the rest of letters and at the end, we are adding the last letter A[j + 1].
Notice that we can precompute powers of P at the beginning and we can do it modulo M.
Since our hashing functions returns integers, we can use any hash table to store them. Remember to make all computations modulo M and avoid integer overflow.
Collisions
Of course, there might occur a collision, but since P is prime and M is really huge, it is a rare situation.
If you want to lower the probability of a collision, you can use two different hashing functions, for example by using different modulo in each of them. If probability of a collision is p using one such function, then for two functions it is p^2 and we can make it arbitrary small by this trick.
Use Rolling hashes.
This will make the runtime expected O(n).
This might be repeating pkacprzak's answer, except, it gives a name for easier remembrance etc.
Suffix Automaton also can finish it in O(N).
It's easy to code, but hard to understand.
Here are papers about it http://dl.acm.org/citation.cfm?doid=375360.375365
http://www.sciencedirect.com/science/article/pii/S0304397509002370

Implementation of string pattern matching using Suffix Array and LCP(-LR)

During the last weeks I tried to figure out how to efficiently find a string pattern within another string.
I found out that for a long time, the most efficient way would have been using a suffix tree. However, since this data structure is very expensive in space, I studied the use of suffix arrays further (which use far less space). Different papers such as "Suffix Arrays: A new method for on-line string searches" (Manber & Myers, 1993) state, that searching for a substring can be realised in O(P+log(N)) (where P is the length of the pattern and N is length of the string) by using binary search and suffix arrays along with LCP arrays.
I especially studied the latter paper to understand the search algorithm. This answer did a great job in helping me understand the algorithm (and incidentally made it into the LCP Wikipedia Page).
But I am still looking for an way to implement this algorithm. Especially the construction of the mentioned LCP-LR arrays seems very complicated.
References:
Manber & Myers, 1993: Manber, Udi ; Myers, Gene, SIAM Journal on Computing, 1993, Vol.22(5), pp.935-948, http://epubs.siam.org/doi/pdf/10.1137/0222058
UPDATE 1
Just to emphasize on what I am interested in: I understood LCP arrays and I found ways to implement them. However, the "plain" LCP array would not be appropriate for efficient pattern matching (as described in the reference). Thus I am interested in implementing LCP-LR arrays which seems much more complicated than just implementing an LCP array
UPDATE 2
Added link to referenced paper
The termin that can help you: enchanced suffix array, which is used to describe suffix array with various other arrays in order to replace suffix tree (lcp, child).
These can be some of the examples:
https://code.google.com/p/esaxx/ ESAXX
http://bibiserv.techfak.uni-bielefeld.de/mkesa/ MKESA
The esaxx one seems to be doing what you want, plus, it has example enumSubstring.cpp how to use it.
If you take a look at the referenced paper, it mentions an useful property (4.2). Since SO does not support math, there is no point to copy it here.
I've done quick implementation, it uses segment tree:
// note that arrSize is O(n)
// int arrSize = 2 * 2 ^ (log(N) + 1) + 1; // start from 1
// LCP = new int[N];
// fill the LCP...
// LCP_LR = new int[arrSize];
// memset(LCP_LR, maxValueOfInteger, arrSize);
//
// init: buildLCP_LR(1, 1, N);
// LCP_LR[1] == [1..N]
// LCP_LR[2] == [1..N/2]
// LCP_LR[3] == [N/2+1 .. N]
// rangeI = LCP_LR[i]
// rangeILeft = LCP_LR[2 * i]
// rangeIRight = LCP_LR[2 * i + 1]
// ..etc
void buildLCP_LR(int index, int low, int high)
{
if(low == high)
{
LCP_LR[index] = LCP[low];
return;
}
int mid = (low + high) / 2;
buildLCP_LR(2*index, low, mid);
buildLCP_LR(2*index+1, mid + 1, high);
LCP_LR[index] = min(LCP_LR[2*index], LCP_LR[2*index + 1]);
}
Here is a fairly simple implementation in C++, though the build() procedure builds the suffix array in O(N lg^2 N) time. The lcp_compute() procedure has linear complexity. I have used this code in many programming contests, and it has never let me down :)
#include <stdio.h>
#include <string.h>
#include <algorithm>
using namespace std;
const int MAX = 200005;
char str[MAX];
int N, h, sa[MAX], pos[MAX], tmp[MAX], lcp[MAX];
bool compare(int i, int j) {
if(pos[i] != pos[j]) return pos[i] < pos[j]; // compare by the first h chars
i += h, j += h; // if prefvious comparing failed, use 2*h chars
return (i < N && j < N) ? pos[i] < pos[j] : i > j; // return results
}
void build() {
N = strlen(str);
for(int i=0; i<N; ++i) sa[i] = i, pos[i] = str[i]; // initialize variables
for(h=1;;h<<=1) {
sort(sa, sa+N, compare); // sort suffixes
for(int i=0; i<N-1; ++i) tmp[i+1] = tmp[i] + compare(sa[i], sa[i+1]); // bucket suffixes
for(int i=0; i<N; ++i) pos[sa[i]] = tmp[i]; // update pos (reverse mapping of suffix array)
if(tmp[N-1] == N-1) break; // check if done
}
}
void lcp_compute() {
for(int i=0, k=0; i<N; ++i)
if(pos[i] != N-1) {
for(int j=sa[pos[i]+1]; str[i+k] == str[j+k];) k++;
lcp[pos[i]] = k;
if(k) k--;
}
}
int main() {
scanf("%s", str);
build();
for(int i=0; i<N; ++i) printf("%d\n", sa[i]);
return 0;
}
Note: If you want the complexity of the build() procedure to become O(N lg N), you can replace the STL sort with radix sort, but this is going to complicate the code.
Edit: Sorry, I misunderstood your question. Although i haven't implemented string matching with suffix array, I think I can describe you a simple non-standard, but fairly efficient algorithm for string matching. You are given two strings, the text, and the pattern. Given these string you create a new one, lets call it concat, which is the concatenation of the two given strings (first the text, then the pattern). You run the suffix array construction algorithm on concat, and you build the normal lcp array. Then, you search for a suffix of length pattern.size() in the suffix array you just built. Lets call its position in the suffix array pos. You then need two pointers lo and hi. At start lo = hi = pos. You decrease lo while lcp(lo, pos) = pattern.size() and you increase hi while lcp(hi, pos) = pattern.size(). Then you search for a suffix of length at least 2*pattern.size() in the range [lo, hi]. If you find it, you found a match. Otherwise, no match exists.
Edit[2]: I will be back with an implementation as soon as I have one...
Edit[3]:
Here it is:
// It works assuming you have builded the concatenated string and
// computed the suffix and the lcp arrays
// text.length() ---> tlen
// pattern.length() ---> plen
// concatenated string: str
bool match(int tlen, int plen) {
int total = tlen + plen;
int pos = -1;
for(int i=0; i<total; ++i)
if(total-sa[i] == plen)
{ pos = i; break; }
if(pos == -1) return false;
int lo, hi;
lo = hi = pos;
while(lo-1 >= 0 && lcp[lo-1] >= plen) lo--;
while(hi+1 < N && lcp[hi] >= plen) hi++;
for(int i=lo; i<=hi; ++i)
if(total-sa[i] >= 2*plen)
return true;
return false;
}
Here is a nice post including some code to help you better understand LCP array and comparison implementation.
I understand your desire is the code, rather than implementing your own.
Although written in Java this is an implementation of Suffix Array with LCP by Sedgewick and Wayne from their Algorithms booksite. It should save you some time and should not be tremendously hard to port to C/C++.
LCP array construction in pseudo for those who might want more information about the algorithm.
I think #Erti-Chris Eelmaa 's algorithm is wrong.
L ... 'M ... M ... M' ... R
|-----|-----|
Left sub range and right sub range should all contains M. Therefore we cannot do normal segment tree partition for LCP-LR array.
Code should look like
def lcp_from_i_j(i, j): # means [i, j] not [i, j)
if (j-i<1) return lcp_2_elem(i, j)
return lcp_merge(lcp_from_i_j(i, (i+j)/2), lcp_from_i_j((i+j)/2, j)
The left and the right sub ranges overlap. The segment tree supports range-min query. However, range min between [a,b] is not equal to lcp between [a,b]. LCP array is continuous, simple range-min would not work!

Efficient way to count number of swaps to insertion sort an array of integers in increasing order

Given an array of values of length n, is there a way to count the number of swaps that would be performed by insertion sort to sort that array in time better than O(n2)?
For example :
arr[]={2 ,1, 3, 1, 2}; // Answer is 4.
Algorithm:
for i <- 2 to N
j <- i
while j > 1 and a[j] < a[j - 1]
swap a[j] and a[j - 1] //I want to count this swaps?
j <- j - 1
If you want to count the number of swaps needed in insertion sort, then you want to find the following number: for each element, how many previous elements inn the array are smaller than it? The sum of these values is then the total number of swaps performed.
To find the number, you can use an order statistic tree, a balanced binary search tree that can efficiently tell you how many elements in the tree are smaller then some given element. Specifically, an orde statistic tree supports O(log n) insertion, deletion, lookup, and count of how many elements in the tree are less than some value. You can then count how many swaps will be performed as follows:
Initialize a new, empty order statistic tree.
Set count = 0
For each array element, in order:
Add the element to the order statistic tree.
Add to count the number of elements in the tree less than the value added.
Return count,
This does O(n) iterations of a loop that takes O(log n) time, so the total work done is O(n log n), which is faster than the brute-force approach.
If you want to count the number of swaps in selection sort, then you can use the fact that insertion sort will only perform a swap on the kth pass if, after processing the first k-1 elements of the list, the element in position k is not the kth smallest element. If you can do this efficiently, then we have the following basic sketch of an algorithm:
Set total = 0
For k = 1 to n:
If the element at index k isn't the kth largest element:
Swap it with the kth largest element.
Increment total
Return total
So how do we implement this efficiently? We need to efficiently be able to check whether the element at a given index is the correct element, and also need to efficiently find the position of the element that really does belong at a given index otherwise. To do this, begin by creating a balanced binary search tree that maps each element to its position in the original array. This takes time O(n log n). Now that you have the balanced tree, we can augment the structure by assigning to each element in the tree the position in the sorted sequence that this element belongs. One way to do this is with an order statistic tree, and another would be to iterate over the tree with an inorder traversal, annotating each value in the tree with its position.
Using this structure, we can check in O(log n) time whether or not an element is in the right position by looking the element up in the tree (time O(log n)), then looking at the position in the sorted sequence at which it should be and at which position it's currently located (remember that we set this up when creating the tree). If it disagrees with our expected position, then it's in the wrong place, and otherwise it's in the right place. Also, we can efficiently simulate a swap of two elements by looking up those two elements in the tree (O(log n) time total) and then swapping their positions in O(1).
As a result, we can implement the above algorithm in time O(n log n) - O(n log n) time to build the tree, then n iterations of doing O(log n) work to determine whether or not to swap.
Hope this helps!
The number of interchanges of consecutive elements necessary to arrange them in their natural order is equal to the number of inversions in the given permutation.
So the solution to this problem is to find the number of inversions in the given array of numbers.
This can be solved in O(n log n) using merge sort.
In the merge step, if you copy an element from the right array, increment a global counter (that counts inversions) by the number of items remaining in the left array. This is done because the element from the right array that just got copied is involved in an inversion with all the elements in present in the left array.
I'm not sure, but I suspect finding the minimum number is a difficult problem. Unless there's a shortcut, you'll just be searching for optimal sorting networks, which you should be able to find good resources on with your favorite search engine (or Wikipedia).
If you only care about the big-O complexity, the answer is O(n log n), and you can probably get more concrete bounds (some actual constants in there) if you look at the analysis of some efficient in-place sorting algorithms like heapsort or smoothsort.
package insertoinSortAnalysis;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class Solution {
private int[] originalArray;
public static void main(String[] args) {
Scanner sc;
try {
sc = new Scanner(System.in);
int TestCases = sc.nextInt();
for (int i = 0; i < TestCases; i++) {
int sizeofarray = sc.nextInt();
Solution s = new Solution();
s.originalArray = new int[sizeofarray];
for (int j = 0; j < sizeofarray; j++)
s.originalArray[j] = sc.nextInt();
s.devide(s.originalArray, 0, sizeofarray - 1);
System.out.println(s.count);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public int[] devide(int[] originalArray, int low, int high) {
if (low < high) {
int mid = (low + high) / 2;
int[] result1 = devide(originalArray, low, mid);
int[] result2 = devide(originalArray, mid + 1, high);
return merge(result1, result2);
}
int[] result = { originalArray[low] };
return result;
}
private long count = 0;
private int[] merge(int[] array1, int[] array2) {
int lowIndex1 = 0;
int lowIndex2 = 0;
int highIndex1 = array1.length - 1;
int highIndex2 = array2.length - 1;
int result[] = new int[array1.length + array2.length];
int i = 0;
while (lowIndex2 <= highIndex2 && lowIndex1 <= highIndex1) {
int element = array1[lowIndex1];
while (lowIndex2 <= highIndex2 && element > array2[lowIndex2]) {
result[i++] = array2[lowIndex2++];
count += ((highIndex1 - lowIndex1) + 1);
}
result[i++] = element;
lowIndex1++;
}
while (lowIndex2 <= highIndex2 && lowIndex1 > highIndex1) {
result[i++] = array2[lowIndex2++];
}
while (lowIndex1 <= highIndex1 && lowIndex2 > highIndex2) {
result[i++] = array1[lowIndex1++];
}
return result;
}
}
Each swap in the insertion sort moves two adjacent elements - one up by one, one down by one - and `corrects' a single crossing by doing so. So:
Annotate each item, X, with its initial array index, Xi.
Sort the items using a stable sort (you can use quicksort if you treat the `initial position' annotation as a minor key)
Return half the sum of the absolute differences between each element's annotated initial position and its final position (i.e. just loop through the annotations summing abs(Xi - i)).
Just like most of the other answers, this is O(n) space and O(n*log n) time. If an in-place merge could be modified to count the crossings, that'd be better. I'm not sure it can though.
#include<stdio.h>
#include<string.h>
#include<iostream>
#include<algorithm>
using namespace std;
int a[200001];
int te[200001];
unsigned long long merge(int arr[],int temp[],int left,int mid,int right)
{
int i=left;
int j=mid;
int k=left;
unsigned long long int icount=0;
while((i<=mid-1) && (j<=right))
{
if(arr[i]<=arr[j])
temp[k++]=arr[i++];
else
{
temp[k++]=arr[j++];
icount+=(mid-i);
}
}
while(i<=mid-1)
temp[k++]=arr[i++];
while(j<=right)
temp[k++]=arr[j++];
for(int i=left;i<=right;i++)
arr[i]=temp[i];
return icount;
}
unsigned long long int mergesort(int arr[],int temp[],int left,int right)
{
unsigned long long int i=0;
if(right>left){
int mid=(left+right)/2;
i=mergesort(arr,temp,left,mid);
i+=mergesort(arr,temp,mid+1,right);
i+=merge(arr,temp,left,mid+1,right);
}
return i;
}
int main()
{
int t,n;
scanf("%d",&t);
while(t--){
scanf("%d",&n);
for(int i=0;i<n;i++){
scanf("%d",&a[i]);
}
printf("%llu\n",mergesort(a,te,0,n-1));
}
return 0;
}

Finding repeating signed integers with O(n) in time and O(1) in space

(This is a generalization of: Finding duplicates in O(n) time and O(1) space)
Problem: Write a C++ or C function with time and space complexities of O(n) and O(1) respectively that finds the repeating integers in a given array without altering it.
Example: Given {1, 0, -2, 4, 4, 1, 3, 1, -2} function must print 1, -2, and 4 once (in any order).
EDIT: The following solution requires a duo-bit (to represent 0, 1, and 2) for each integer in the range of the minimum to the maximum of the array. The number of necessary bytes (regardless of array size) never exceeds (INT_MAX – INT_MIN)/4 + 1.
#include <stdio.h>
void set_min_max(int a[], long long unsigned size,\
int* min_addr, int* max_addr)
{
long long unsigned i;
if(!size) return;
*min_addr = *max_addr = a[0];
for(i = 1; i < size; ++i)
{
if(a[i] < *min_addr) *min_addr = a[i];
if(a[i] > *max_addr) *max_addr = a[i];
}
}
void print_repeats(int a[], long long unsigned size)
{
long long unsigned i;
int min, max = min;
long long diff, q, r;
char* duos;
set_min_max(a, size, &min, &max);
diff = (long long)max - (long long)min;
duos = calloc(diff / 4 + 1, 1);
for(i = 0; i < size; ++i)
{
diff = (long long)a[i] - (long long)min; /* index of duo-bit
corresponding to a[i]
in sequence of duo-bits */
q = diff / 4; /* index of byte containing duo-bit in "duos" */
r = diff % 4; /* offset of duo-bit */
switch( (duos[q] >> (6 - 2*r )) & 3 )
{
case 0: duos[q] += (1 << (6 - 2*r));
break;
case 1: duos[q] += (1 << (6 - 2*r));
printf("%d ", a[i]);
}
}
putchar('\n');
free(duos);
}
void main()
{
int a[] = {1, 0, -2, 4, 4, 1, 3, 1, -2};
print_repeats(a, sizeof(a)/sizeof(int));
}
The definition of big-O notation is that its argument is a function (f(x)) that, as the variable in the function (x) tends to infinity, there exists a constant K such that the objective cost function will be smaller than Kf(x). Typically f is chosen to be the smallest such simple function such that the condition is satisfied. (It's pretty obvious how to lift the above to multiple variables.)
This matters because that K — which you aren't required to specify — allows a whole multitude of complex behavior to be hidden out of sight. For example, if the core of the algorithm is O(n2), it allows all sorts of other O(1), O(logn), O(n), O(nlogn), O(n3/2), etc. supporting bits to be hidden, even if for realistic input data those parts are what actually dominate. That's right, it can be completely misleading! (Some of the fancier bignum algorithms have this property for real. Lying with mathematics is a wonderful thing.)
So where is this going? Well, you can assume that int is a fixed size easily enough (e.g., 32-bit) and use that information to skip a lot of trouble and allocate fixed size arrays of flag bits to hold all the information that you really need. Indeed, by using two bits per potential value (one bit to say whether you've seen the value at all, another to say whether you've printed it) then you can handle the code with fixed chunk of memory of 1GB in size. That will then give you enough flag information to cope with as many 32-bit integers as you might ever wish to handle. (Heck that's even practical on 64-bit machines.) Yes, it's going to take some time to set that memory block up, but it's constant so it's formally O(1) and so drops out of the analysis. Given that, you then have constant (but whopping) memory consumption and linear time (you've got to look at each value to see whether it's new, seen once, etc.) which is exactly what was asked for.
It's a dirty trick though. You could also try scanning the input list to work out the range allowing less memory to be used in the normal case; again, that adds only linear time and you can strictly bound the memory required as above so that's constant. Yet more trickiness, but formally legal.
[EDIT] Sample C code (this is not C++, but I'm not good at C++; the main difference would be in how the flag arrays are allocated and managed):
#include <stdio.h>
#include <stdlib.h>
// Bit fiddling magic
int is(int *ary, unsigned int value) {
return ary[value>>5] & (1<<(value&31));
}
void set(int *ary, unsigned int value) {
ary[value>>5] |= 1<<(value&31);
}
// Main loop
void print_repeats(int a[], unsigned size) {
int *seen, *done;
unsigned i;
seen = calloc(134217728, sizeof(int));
done = calloc(134217728, sizeof(int));
for (i=0; i<size; i++) {
if (is(done, (unsigned) a[i]))
continue;
if (is(seen, (unsigned) a[i])) {
set(done, (unsigned) a[i]);
printf("%d ", a[i]);
} else
set(seen, (unsigned) a[i]);
}
printf("\n");
free(done);
free(seen);
}
void main() {
int a[] = {1,0,-2,4,4,1,3,1,-2};
print_repeats(a,sizeof(a)/sizeof(int));
}
Since you have an array of integers you can use the straightforward solution with sorting the array (you didn't say it can't be modified) and printing duplicates. Integer arrays can be sorted with O(n) and O(1) time and space complexities using Radix sort. Although, in general it might require O(n) space, the in-place binary MSD radix sort can be trivially implemented using O(1) space (look here for more details).
The O(1) space constraint is intractable.
The very fact of printing the array itself requires O(N) storage, by definition.
Now, feeling generous, I'll give you that you can have O(1) storage for a buffer within your program and consider that the space taken outside the program is of no concern to you, and thus that the output is not an issue...
Still, the O(1) space constraint feels intractable, because of the immutability constraint on the input array. It might not be, but it feels so.
And your solution overflows, because you try to memorize an O(N) information in a finite datatype.
There is a tricky problem with definitions here. What does O(n) mean?
Konstantin's answer claims that the radix sort time complexity is O(n). In fact it is O(n log M), where the base of the logarithm is the radix chosen, and M is the range of values that the array elements can have. So, for instance, a binary radix sort of 32-bit integers will have log M = 32.
So this is still, in a sense, O(n), because log M is a constant independent of n. But if we allow this, then there is a much simpler solution: for each integer in the range (all 4294967296 of them), go through the array to see if it occurs more than once. This is also, in a sense, O(n), because 4294967296 is also a constant independent of n.
I don't think my simple solution would count as an answer. But if not, then we shouldn't allow the radix sort, either.
I doubt this is possible. Assuming there is a solution, let's see how it works. I'll try to be as general as I can and show that it can't work... So, how does it work?
Without losing generality we could say we process the array k times, where k is fixed. The solution should also work when there are m duplicates, with m >> k. Thus, in at least one of the passes, we should be able to output x duplicates, where x grows when m grows. To do so, some useful information has been computed in a previous pass and stored in the O(1) storage. (The array itself can't be used, this would give O(n) storage.)
The problem: we have O(1) of information, when we walk over the array we have to identify x numbers(to output them). We need a O(1) storage than can tell us in O(1) time, if an element is in it. Or said in a different way, we need a data structure to store n booleans (of wich x are true) that uses O(1) space, and takes O(1) time to query.
Does this data structure exists? If not, then we can't find all duplicates in an array with O(n) time and O(1) space (or there is some fancy algorithm that works in a completely different manner???).
I really don't see how you can have only O(1) space and not modify the initial array. My guess is that you need an additional data structure. For example, what is the range of the integers? If it's 0..N like in the other question you linked, you can have an additinal count array of size N. Then in O(N) traverse the original array and increment the counter at the position of the current element. Then traverse the other array and print the numbers with count >= 2. Something like:
int* counts = new int[N];
for(int i = 0; i < N; i++) {
counts[input[i]]++;
}
for(int i = 0; i < N; i++) {
if(counts[i] >= 2) cout << i << " ";
}
delete [] counts;
Say you can use the fact you are not using all the space you have. You only need one more bit per possible value and you have lots of unused bit in your 32-bit int values.
This has serious limitations, but works in this case. Numbers have to be between -n/2 and n/2 and if they repeat m times, they will be printed m/2 times.
void print_repeats(long a[], unsigned size) {
long i, val, pos, topbit = 1 << 31, mask = ~topbit;
for (i = 0; i < size; i++)
a[i] &= mask;
for (i = 0; i < size; i++) {
val = a[i] & mask;
if (val <= mask/2) {
pos = val;
} else {
val += topbit;
pos = size + val;
}
if (a[pos] < 0) {
printf("%d\n", val);
a[pos] &= mask;
} else {
a[pos] |= topbit;
}
}
}
void main() {
long a[] = {1, 0, -2, 4, 4, 1, 3, 1, -2};
print_repeats(a, sizeof (a) / sizeof (long));
}
prints
4
1
-2