Suffix Array Algorithm Implementation

Suffix Array Algorithm Implementation - c++

I am currently trying to implement generalized suffix array according to the algorithm described in this paper: paper.
However I am stuck at the moment with implementation of the sorting algorithm in chapter 2.
My current c++ code looks roughly like this (my alphabet is lowercase English letters):
std::vector<std::pair<int,int>> suffix_array(const std::vector<std::string>& ss) {
std::vector<std::vector<std::pair<int,int>>> tmp(26);
size_t n = 0;
for (size_t i = 0; i < ss.size(); i++) {
if (ss[i].length() > n) {
n = ss[i].length();
}
for (size_t j = 0; j < ss[i].length(); j++) {
tmp[ss[i][j] - 'a'].push_back(std::make_pair(i,j));
}
}
// initialize pos
std::vector<std::pair<int,int>> pos;
std::vector<bool> bh;
for (auto &v1 : tmp) {
bool b = true;
for (auto &p: v1) {
pos.push_back(p);
bh.push_back(b);
b = false;
}
}
// initialze inv_pos
std::map<std::pair<int,int>,int> inv_pos;
for (size_t i = 0; i < pos.size(); i++) {
inv_pos[pos[i]] = i;
}
int H = 1;
while (H <= n) {
std::vector<int> count(pos.size(), 0);
std::vector<bool> b2h(bh);
for (size_t i = 0, j = 0; i < pos.size(); i++) {
if (bh[i]) {
j = i;
}
inv_pos[pos[i]] = j;
}
int k = 0;
int i = 0;
while (i < pos.size()) {
int j = k;
i = j;
do {
auto t = std::make_pair(pos[i].first, pos[i].second - H);
if (t.second >= 0) {
int q = inv_pos[t];
count[q] += 1;
inv_pos[t] += (count[q] - 1);
b2h[inv_pos[t]] = true;
}
i++;
} while (i < pos.size() && !bh[i]);
k = i;
i = j;
do {
auto t = std::make_pair(pos[i].first, pos[i].second - H);
if (t.second >= 0) {
int q = inv_pos[t];
if ((j <= q) && (q < k) && (j <= (q+1)) &&
((q+1) < k) && b2h[q+1]) {
b2h[q+1] = false;
}
}
i++;
} while (i < k);
}
bh = b2h;
for (auto &x : inv_pos) {
pos[x.second] = x.first;
}
H *= 2;
}
return pos;
}
At the moment, I get garbage results with my implementation. And I don't quite understand from the algorithm description in the paper how inv_pos is correctly updated after each stage...
If someone can spot what is wrong with my implementation and show me the right direction with brief explanation, I would be really grateful.

Related

Why does this time out? Is the time complexity not O(n^2logn)?

This problem comes from a leetcode problem called "3sum." I will include the problem statement, as it might be helpful in answering my questions.
Problem Statement: Given an array nums of n integers, are there elements a, b, c in nums such that a + b + c = 0? Find all unique triplets in the array which gives the sum of zero. The solution set must not contain duplicate triplets.
I am getting time limit exceeded on the leetcode platform, but I have noticed several other O(n^2logn) solutions being accepted. Am I correct in believing that the time complexity of my solutions is O(n^2logn)? Is my binary search implemented correctly? The only guess I have to as to what is causing my solution to time out, when other similarly implemented O(n^2logn) solutions are accepted, is some sort of error in my binary search loop.
Please help me get to the bottom of this. I have posted the code below.
NOTE: my code times out on a test case consisting of all 0's but is good for the other ones
class Solution {
public:
vector<vector<int>> threeSum(vector<int>& nums) {
sort(nums.begin(), nums.end());
vector<vector<int>> ans;
vector<vector<int>> ans2;
unordered_set<string> myset;
unordered_set<string>::iterator itr;
int L, M, R, target, temp;
string s = "";
for (int i = 0; i < nums.size()-2; i++)
for (int j = i + 1; j < nums.size(); j++) {
L = j + 1;
R = nums.size() - 1;
target = (nums[i] + nums[j]) * -1;
while (R >= L) {
M = L + (R - L) / 2;
if (nums[M] == target) {
ans.push_back({nums[i], nums[j], nums[M]});
}
if (nums[M] >= target) {
R = M - 1;
} else {
L = M + 1;
}
}
}
for (int i = 0; i < ans.size(); i++) {
s = to_string(ans[i][0]) + to_string(ans[i][1]) + to_string(ans[i][2]);
itr = myset.find(s);
if (itr == myset.end()) {
ans2.push_back(ans[i]);
myset.insert(s);
}
}
return ans2;
}
};

I guess your bug is explained in the first comment or it would probably get stuck in the Binary Search while loop.
This'll pass through:
// This block might trivially optimize the exec time;
// Can be removed;
static const auto __optimize__ = []() {
std::ios::sync_with_stdio(false);
std::cin.tie(NULL);
return 0;
}();
// Most of headers are already included;
// Can be removed;
#include <cstdint>
#include <vector>
#include <algorithm>
static const struct Solution {
using SizeType = std::int_fast32_t;
static const std::vector<std::vector<int>> threeSum(
std::vector<int>& nums
) {
std::vector<std::vector<int>> triplets;
std::sort(std::begin(nums), std::end(nums));
for (auto index = 0; index < std::size(nums); index++) {
const SizeType target = -nums[index];
SizeType lo = index + 1;
SizeType hi = nums.size() - 1;
if (target < 0) {
break;
}
while (lo < hi) {
const SizeType sum = nums[lo] + nums[hi];
if (sum < target) {
lo++;
} else if (sum > target) {
hi--;
} else {
std::vector<int> triplet(3, 0);
triplet[0] = nums[index];
triplet[1] = nums[lo];
triplet[2] = nums[hi];
triplets.emplace_back(triplet);
while (lo < hi && nums[lo] == triplet[1]) {
lo++;
}
while (lo < hi && nums[lo] == triplet[2]) {
hi--;
}
}
}
while (index + 1 < std::size(nums) && nums[index + 1] == nums[index]) {
index++;
}
}
return triplets;
}
};
Here are some more efficient Solutions by LeetCode:
Similar to yours:
class Solution {
public:
vector<vector<int>> threeSum(vector<int>& nums) {
sort(begin(nums), end(nums));
vector<vector<int>> res;
for (int i = 0; i < nums.size() && nums[i] <= 0; ++i)
if (i == 0 || nums[i - 1] != nums[i]) {
twoSumII(nums, i, res);
}
return res;
}
void twoSumII(vector<int>& nums, int i, vector<vector<int>> &res) {
int lo = i + 1, hi = nums.size() - 1;
while (lo < hi) {
int sum = nums[i] + nums[lo] + nums[hi];
if (sum < 0) {
++lo;
} else if (sum > 0) {
--hi;
} else {
res.push_back({ nums[i], nums[lo++], nums[hi--] });
while (lo < hi && nums[lo] == nums[lo - 1])
++lo;
}
}
}
};
Using twoSum helper:
class Solution {
public:
vector<vector<int>> threeSum(vector<int>& nums) {
sort(begin(nums), end(nums));
vector<vector<int>> res;
for (int i = 0; i < nums.size() && nums[i] <= 0; ++i)
if (i == 0 || nums[i - 1] != nums[i]) {
twoSum(nums, i, res);
}
return res;
}
void twoSum(vector<int>& nums, int i, vector<vector<int>> &res) {
unordered_set<int> seen;
for (int j = i + 1; j < nums.size(); ++j) {
int complement = -nums[i] - nums[j];
if (seen.count(complement)) {
res.push_back({nums[i], complement, nums[j]});
while (j + 1 < nums.size() && nums[j] == nums[j + 1]) {
++j;
}
}
seen.insert(nums[j]);
}
}
};
I guess this would be a better Solution, which doesn't use sorting.
O(N ^ 2) Time
O(N) Space
class Solution {
public:
vector<vector<int>> threeSum(vector<int>& nums) {
set<vector<int>> res;
unordered_set<int> dups;
unordered_map<int, int> seen;
for (int i = 0; i < nums.size(); ++i)
if (dups.insert(nums[i]).second) {
for (int j = i + 1; j < nums.size(); ++j) {
int complement = -nums[i] - nums[j];
auto it = seen.find(complement);
if (it != end(seen) && it->second == i) {
vector<int> triplet = {nums[i], nums[j], complement};
sort(begin(triplet), end(triplet));
res.insert(triplet);
}
seen[nums[j]] = i;
}
}
return vector<vector<int>>(begin(res), end(res));
}
};
References
For additional details, please see the Discussion Board which you can find plenty of well-explained accepted solutions in there, with a variety of languages including efficient algorithms and asymptotic time/space complexity analysis1, 2.

Implementation of string::find_first_of

My confusion comes from a solution for problem 792. Number of Matching Subsequences in leetcode, the naive solution is to check all characters in S for each searching word.
The time complexity is too high to pass all test cases as outlined in the official answer.
The first code piece listed below could beats 100% merely because it calls string::find_first_of function, while my hand written codes uses exactly the same routine(I believe) get TLE as expected.
So, why the first one is so fast?
// comes from the second solution post in leetcode-cn
class Solution {
public:
int numMatchingSubseq(string S, vector<string>& words) {
int res = 0, j;
for (int i = 0; i < words.size(); i ++) {
int position = -1;
for (j = 0; j < words[i].size(); j ++) {
position = S.find_first_of(words[i][j], position + 1);
if (position == -1) break;
}
if (j == words[i].length()) res ++;
}
return res;
}
};
version 2 hand written Time limit exceed
static inline int find_first_of(const string & s, char & c, int st) {
int n = s.size() - st;
const char * data = s.data(), * cur = data + st;
for (std::size_t i = 0; i < n; ++i) {
if (cur[i] == c)
return cur + i - data;
}
return -1;
}
class Solution {
public:
int numMatchingSubseq(string S, vector<string>& words) {
int res = 0, j;
for (int i = 0; i < words.size(); i ++) {
int position = -1;
for (j = 0; j < words[i].size(); j ++) {
position = find_first_of(S, words[i][j], position + 1);
if (position == -1) break;
}
if (j == words[i].length()) res ++;
}
return res;
}
};

failure in QVector<T>::operator[]: "index out of range" is thrown unexpectedly

I have two sorting methods: insertion sort and shell sort. Two of those working function I have adapted to C++ from plain C. The problem is that ins_sort function works just well and shell_sort fails. What reason for that can be?
bool less(QVector<int> &arr, int a, int b)
{
return arr[a] < arr[b];
}
// Performs swap on elements at a and b in QVector<int> arr
void qswap(QVector<int> &arr, int a, int b)
{
int temp = arr[a];
arr[a] = arr[b];
arr[b] = temp;
}
/* Failure is thrown in this method */
void shell_sort(GraphicsView &window, SwapManager &manager)
{
auto list = window.items();
QVector<int> arr;
for (auto item : list)
arr.push_back(static_cast<QGraphicsRectWidget*>(item)->m_number);
int N = arr.size();
int h = 1;
while (h < N/3) h = 3*h + 1;
while (h >= 1)
{
for (int i = h; i < N; ++i)
{
for (int j = i; less(arr, j, j-h) && j >= h; j -= h)
{
qswap(arr, j, j-h);
manager.addPair(j, j - h);
}
}
h /= 3;
}
}
And that one does well.
/* This method works just fine */
void ins_sort(GraphicsView &window, SwapManager &manager)
{
auto list = window.items();
int i, j;
QVector<int> arr;
for (auto item : list)
{
arr.push_back(static_cast<QGraphicsRectWidget*>(item)->m_number);
}
int N = arr.size();
for (i = 1; i < N; ++i)
{
for (j = i - 1; j != -1 && less(arr, j + 1, j); --j)
{
qswap(arr, j, j + 1);
manager.addPair(j, j + 1);
}
}
}
Debugger points to this piece of code in "qvector.h"
Q_ASSERT_X(i >= 0 && i < d->size, "QVector<T>::operator[]", "index out of range");
return data()[i]; }

In the for-loop condition there is sense to check j value before comparing items:
for (int j = i; j >= h && less(arr, j, j-h); j -= h)

Getting cygwin_exception::open_stackdumpfile working with vectors

I get cygwin_exception::open_stackdumpfile trying to run my code. I guess it is an error concerning memory. I am almost 100% sure I get this error because of not properly creating, sending to function or dealing with maximalSets and/or compsub vectors.
Please help me to solve this error and thank you for all your answers!
update: fixed the problem pointed out by Paul Evans. Now I get the exception from time to time, but still get it.
class Graph {
private:
int size; // the number of node
vector<vector<bool> > connected;
vector<vector<int> > bkv2(vector<int> oldSet, int ne, int ce,
vector<int> &compsub, vector<vector<int> > maximalSets);
public:
Graph(int size);
vector<vector<int> > findMaximalSets();
}
Graph::Graph(int n) {
int i, j;
srand((unsigned int) time( NULL));
size = n;
connected.resize(size);
for (int i=0; i<size; i++) {
connected[i].resize(size);
}
for (i = 0; i < size; i++) { // the graph is randomly generated
connected[i][i] = 1;
for (j = i + 1; j < size; j++) {
if (rand() % 2 == 1) {
connected[i][j] = 1;
connected[j][i] = 1;
} else {
connected[i][j] = 0;
connected[j][i] = 0;
}
}
}
}
vector<vector<int> > Graph::findMaximalSets() {
int i;
vector<int> all(size);
vector<int> compsub;
vector<vector<int> > maximalSets;
for (i = 0; i < size; i++) {
all[i] = i;
}
return bkv2(all, 0, size, compsub, maximalSets);
}
vector<vector<int> > Graph::bkv2(vector<int> oldSet, int ne, int ce,
vector<int> &compsub, vector<vector<int> > maximalSets) {
vector<int> newSet(ce);
int nod, fixp;
int newne, newce, i, j, count, pos, p, s, sel, minnod;
minnod = ce;
nod = 0;
for (i = 0; i < ce && minnod != 0; i++) {
p = oldSet[i];
count = 0;
for (j = ne; j < ce && count < minnod; j++)
if (connected[p][oldSet[j]]) {
count++;
pos = j;
}
if (count < minnod) {
fixp = p;
minnod = count;
if (i < ne) {
s = pos;
} else {
s = i;
nod = 1;
}
}
}
for (nod = minnod + nod; nod >= 1; nod--) {
p = oldSet[s];
oldSet[s] = oldSet[ne];
sel = oldSet[ne] = p;
newne = 0;
for (i = 0; i < ne; i++) {
if (!connected[sel][oldSet[i]]) {
newSet[newne++] = oldSet[i];
}
}
newce = newne;
for (i = ne + 1; i < ce; i++) {
if (!connected[sel][oldSet[i]]) {
newSet[newce++] = oldSet[i];
}
}
compsub.push_back(sel);
if (newce == 0) {
vector<int> copy = compsub;
maximalSets.push_back(copy);
} else if (newne < newce) {
maximalSets = bkv2(newSet, newne, newce, compsub, maximalSets);
}
compsub.pop_back();
ne++;
if (nod > 1) {
for (s = ne; !connected[fixp][oldSet[s]]; s++) {
}
}
}
return maximalSets;
}

It's because you're not initializing size anywhere.

C++ variable number of nested loops

I want to make a function that, depending on the depth of nested loop, does this:
if depth = 1:
for(i = 0; i < max; i++){
pot[a++] = wyb[i];
}
if depth = 2:
for(i = 0; i < max; i++){
for( j = i+1; j < max; j++){
pot[a++] = wyb[i] + wyb[j];
}
}
if depth = 3:
for(i = 0; i < max; i++){
for( j = i+1; j < max; j++){
for( k = j+1; k < max; k++){
pot[a++] = wyb[i] + wyb[j] + wyb[k];
}
}
}
and so on.
So the result would be:
depth = 1
pot[0] = wyb[0]
pot[1] = wyb[1]
...
pot[max-1] = wyb[max-1]
depth = 2, max = 4
pot[0] = wyb[0] + wyb[1]
pot[1] = wyb[0] + wyb[2]
pot[2] = wyb[0] + wyb[3]
pot[3] = wyb[1] + wyb[2]
pot[4] = wyb[1] + wyb[3]
pot[5] = wyb[2] + wyb[3]
I think you get the idea. I can't think of a way to do this neatly.
Could someone present an easy way of using recursion (or maybe not?) to achieve this, keeping in mind that I'm still a beginner in c++, to point me in the right direction?
Thank you for your time.

You may use the std::next_permutation to manage the combinaison:
std::vector<int> compute(const std::vector<int>& v, std::size_t depth)
{
if (depth == 0 || v.size() < depth) {
throw "depth is out of range";
}
std::vector<int> res;
std::vector<int> coeffs(depth, 1);
coeffs.resize(v.size(), 0); // flags is now {1, .., 1, 0, .., 0}
do {
int sum = 0;
for (std::size_t i = 0; i != v.size(); ++i) {
sum += v[i] * coeffs[i];
}
res.push_back(sum);
} while (std::next_permutation(coeffs.rbegin(), coeffs.rend()));
return res;
}
Live example

Simplified recursive version:
int *sums_recursive(int *pot, int *wyb, int max, int depth) {
if (depth == 1) {
while (max--)
*pot++ = *wyb++;
return pot;
}
for (size_t i = 1; i <= max - depth + 1; ++i) {
int *pot2 = sums_recursive(pot, wyb + i, max - i, depth - 1);
for (int *p = pot ; p < pot2; ++p) *p += wyb[i - 1];
pot = pot2;
}
return pot;
}
Iterative version:
void sums(int *pot, int *wyb, int max, int depth) {
int maxi = 1;
int o = 0;
for (int d = 0; d < depth; ++d) { maxi *= max; }
for (int i = 0; i < maxi; ++i) {
int i_div = i;
int idx = -1;
pot[o] = 0;
int d;
for (d = 0; d < depth; ++d) {
int new_idx = i_div % max;
if (new_idx <= idx) break;
pot[o] += wyb[new_idx];
idx = new_idx;
i_div /= max;
}
if (d == depth) o++;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Suffix Array Algorithm Implementation - c++

Related

Why does this time out? Is the time complexity not O(n^2logn)?

Implementation of string::find_first_of

failure in QVector<T>::operator[]: "index out of range" is thrown unexpectedly

Getting cygwin_exception::open_stackdumpfile working with vectors

C++ variable number of nested loops

Categories

Resources