C++ Longest Common Subsequence Implementation errors O(n*m)

C++ Longest Common Subsequence Implementation errors O(n*m) - c++

I'm going through some dynamic programming articles on geeksforgeeks and ran across the Longest Common Subsequence problem. I did not come up with an implementation of the exponential naive solution on my own, however after working out some examples of the problem on paper I came up with what I thought was a successful implementation of an O(n*m) version . However, an OJ proved me wrong. My algorithm fails with the input strings:
"LRBBMQBHCDARZOWKKYHIDDQSCDXRJMOWFRXSJYBLDBEFSARCBYNECDYGGXXPKLORELLNMPAPQFWKHOPKMCO"
"QHNWNKUEWHSQMGBBUQCLJJIVSWMDKQTBXIXMVTRRBLJPTNSNFWZQFJMAFADRRWSOFSBCNUVQHFFBSAQXWPQCAC"
My thought process for the algorithm is as follows. I want to maintain a DP array whose length is the length of string a where a is the smaller of the input strings. dpA[i] would be the Longest Common Subsequence ending in a[i]. To do this I need to iterate through string a from index 0 => length-1 and see if a[i] exists in b. If a[i] exists in b it will be at position pos.
First mark dp[i] as 1 if dp[i] was 0
To know that a[i] is an extension of an existing subsequence we must go through a and find the first character behind i that matches a value in b behind pos. Let's call the indices of these matching values j and k respectively. This value is guaranteed to be a value we've seen before since we've covered all of a[0...i-1] and have filled out dpA[0...i-1]. When we find the first match, dpA[i] = dpA[j]+1 because we're extending the previous subsequence that ends in a[j]. Rinse repeat.
Obviously this method is not perfect or I wouldn't be asking this question, but I can't quite seem to see the problem with the algorithm. I've been looking at it so long I can hardly think about it anymore but any ideas on how to fix it would be greatly appreciated!
int longestCommonSubsequenceString(const string& x, const string& y) {
string a = (x.length() < y.length()) ? x : y;
string b = (x.length() >= y.length()) ? x : y;
vector<int> dpA(a.length(), 0);
int pos;
bool breakFlag = false;
for (int i = 0; i < a.length(); ++i) {
pos = b.find_last_of(a[i]);
if (pos != string::npos) {
if (!dpA[i]) dpA[i] = 1;
for (int j = i-1; j >= 0; --j) {
for (int k = pos-1; k >= 0; --k) {
if (a[j] == b[k]) {
dpA[i] = dpA[j]+1;
breakFlag = true;
break;
}
if (breakFlag) break;
}
}
}
breakFlag = false;
}
return *max_element(dpA.begin(), dpA.end());
}
EDIT
I think the complexity might actually be O(n*n*m)

Related

Runtime of KMP algorithm and LPS table construction

I recently came across the KMP algorithm, and I have spent a lot of time trying to understand why it works. While I do understand the basic functionality now, I simply fail to understand the runtime computations.
I have taken the below code from the geeksForGeeks site: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
This site claims that if the text size is O(n) and pattern size is O(m), then KMP computes a match in max O(n) time. It also states that the LPS array can be computed in O(m) time.
// C++ program for implementation of KMP pattern searching
// algorithm
#include <bits/stdc++.h>
void computeLPSArray(char* pat, int M, int* lps);
// Prints occurrences of txt[] in pat[]
void KMPSearch(char* pat, char* txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int lps[M];
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
int j = 0; // index for pat[]
while (i < N) {
if (pat[j] == txt[i]) {
j++;
i++;
}
if (j == M) {
printf("Found pattern at index %d ", i - j);
j = lps[j - 1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i]) {
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j - 1];
else
i = i + 1;
}
}
}
// Fills lps[] for given patttern pat[0..M-1]
void computeLPSArray(char* pat, int M, int* lps)
{
// length of the previous longest prefix suffix
int len = 0;
lps[0] = 0; // lps[0] is always 0
// the loop calculates lps[i] for i = 1 to M-1
int i = 1;
while (i < M) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
// This is tricky. Consider the example.
// AAACAAAA and i = 7. The idea is similar
// to search step.
if (len != 0) {
len = lps[len - 1];
// Also, note that we do not increment
// i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char txt[] = "ABABDABACDABABCABAB";
char pat[] = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
I am really confused why that is the case.
For LPS computation, consider: aaaaacaaac
In this case, when we try to compute LPS for the first c, we would keep going back until we hit LPS[0], which is 0 and stop. So, essentially, we would travel back atleast the length of the pattern until that point. If this happens multiple times, how will time complexity be O(m)?
I have similar confusion on runtime of KMP to be O(n).
I have read other threads in stack overflow before posting, and also various other sites on the topic. I am still very confused. I would really appreciate if someone can help me understand the best and worse case scenarios for these algorithms and how their runtime is computed using some examples. Again, please don't suggest I google this, I have done it, spent a whole week trying to gain any insight, and failed.

One way to establish an upper bound on the runtime for construction of the LPS array is to consider a pathological case - how can we maximize the number of times we have to execute len = lps[len - 1]? Consider the following string, ignoring spaces: x1 x2 x1x3 x1x2x1x4 x1x2x1x3x1x2x1x5 ...
The second term needs to be compared to the first term as if it ended in 1 instead of 2, it would match the first term. Similarly the third term needs to be compared to the first two terms as if it ended in 1 or 2 instead of 3, it would match those partial terms. And so forth.
In the example string, it is clear that only every 1/2^n characters can match n times, so the total runtime will be m+m/2+m/4+..=2m=O(m), the length of the pattern string. I suspect it's impossible to construct a string with worse runtime than the example string and this can probably be formally proven.

Knapsack Backtracking Using only O(W) space

So I have this code that I have written that correctly finds the optimal value for the knapsack problem.
int mat[2][size + 1];
memset(mat, 0, sizeof(mat));
int i = 0;
while(i < nItems)
{
int j = 0;
if(i % 2 != 0)
{
while(++j <= size)
{
if(weights[i] <= j) mat[1][j] = max(values[i] + mat[0][j - weights[i]], mat[0][j]);
else mat[1][j] = mat[0][j];
}
}
else
{
while(++j <= size)
{
if(weights[i] <= j) mat[0][j] = max(values[i] + mat[1][j - weights[i]], mat[1][j]);
else mat[0][j] = mat[1][j];
}
}
i++;
}
int val = (nItems % 2 != 0)? mat[0][size] : mat[1][size];
cout << val << endl;
return 0;
This part I udnerstand. However I am trying to keep the same memory space, i.e. O(W), but also now compute the optimal solution using backtracking. This is where I am finding trouble. The hints I have been given is this
Now suppose that we also want the optimal set of items. Recall that the goal
in ﬁnding the optimal solution in part 1 is to ﬁnd the optimal path from
entry K(0,0) to entry K(W,n). The optimal path must pass through an
intermediate node (k,n/2) for some k; this k corresponds to the remaining
capacity in the knapsack of the optimal solution after items n/2 + 1,...n
have been considered
The question asked is this.
Implement a modiﬁed version of the algorithm from part 2 that returns not
only the optimal value, but also the remaining capacity of the optimal
solution after the last half of items have been considered
Any help would be apprecaited to get me started. Thanks

Find which two values in an array maximize a given expression?

I met a very simple interview question, but my solution is incorrect. Any helps on this? 1)any bugs in my solution? 2)any good idea for time complexity O(n)?
Question:
Given an int array A[], define X=A[i]+A[j]+(j-i), j>=i. Find max value of X?
My solution is:
int solution(vector<int> &A){
if(A.empty())
return -1;
long long max_dis=-2000000000, cur_dis;
int size = A.size();
for(int i=0;i<size;i++){
for(int j=i;j<size;j++){
cur_dis=A[j]+A[i]+(j-i);
if(cur_dis > max_dis)
max_dis=cur_dis;
}
}
return max_dis;
}

The crucial insight is that it can be done in O(n) only if you track where potentially useful values are even before you're certain they'll prove usable.
Start with best_i = best_j = max_i = 0. The first two track the i and j values to use in the solution. The next one will record the index with the highest contributing factor for i, i.e. where A[i] - i is highest.
Let's call the value of X for some values of i and j "Xi,j", and start by recording our best solution so far ala Xbest = X0,0
Increment n along the array...
whenever the value at [n] gives a better "i" contribution for A[i] - i than max_i, update max_i.
whenever using n as the "j" index yields Xmax_i,n greater than Xbest, best_i = max_i, best_j = n.
Discussion - why/how it works
j_random_hacker's comment suggests I sketch a proof, but honestly I've no idea where to start. I'll try to explain as best I can - if someone else has a better explanation please chip in....
Restating the problem: greatest Xi,j where j >= i. Given we can set an initial Xbest of X0,0, the problem is knowing when to update it and to what. As we contemplate successive indices in the array as potential values for j, we want to generate Xi,j=n for some i (discussed next) to compare with Xbest. But, what i value to use? Well, given any index from 0 to n is <= j, the j >= i constraint isn't relevant if we pick the best i value from the indices we've already visited. We work out the best i value by separating the i-related contribution to X from the j-related contribution - A[i] - i - so in preparation for considering whether we've a new best solution with j=n we must maintain the best_i variable too as we go.
A way to approach the problem
For whatever it's worth - when I was groping around for a solution, I wrote down on paper some imaginary i and j contributions that I could see covered the interesting cases... where Ci and Cj are the contributions related to n's use as i and j respectively, something like
n 0 1 2 3 4
Ci 4 2 8 3 1
Cj 12 4 3 5 9
You'll notice I didn't bother picking values where Ci could be A[i] - i while Cj was A[j] + j... I could see the emerging solution should work for any formulas, and that would have just made it harder to capture the interesting cases. So - what's the interesting case? When n = 2 the Ci value is higher than anything we've seen in earlier elements, but given only knowledge of those earlier elements we can't yet see a way to use it. That scenario is the single "great" complication of the problem. What's needed is a Cj value of at least 9 so Xbest is improved, which happens to come along when n = 4. If we'd found an even better Ci at [3] then we'd of course want to use that. best_i tracks where that waiting-on-a-good-enough-Cj value index is.

Longer version of my comment: what about iterating the array from both ends, trying to find the highest number, while decreasing it by the distance from the appripriate end. Would that find the correct indexes (and thus the correct X)?
#include <vector>
#include <algorithm>
#include <iostream>
#include <random>
#include <climits>
long long brutal(const std::vector<int>& a) {
long long x = LLONG_MIN;
for(int i=0; i < a.size(); i++)
for(int j=i; j < a.size(); j++)
x = std::max(x, (long long)a[i] + a[j] + j-i);
return x;
}
long long smart(const std::vector<int>& a) {
if(a.size() == 0) return LLONG_MIN;
long long x = LLONG_MIN, y = x;
for(int i = 0; i < a.size(); i++)
x = std::max(x, (long long)a[i]-i);
for(int j = 0; j < a.size(); j++)
y = std::max(y, (long long)a[j]+j);
return x + y;
}
int main() {
std::random_device rd;
std::uniform_int_distribution<int> rlen(0, 1000);
std::uniform_int_distribution<int> rnum(INT_MIN,INT_MAX);
std::vector<int> v;
for(int loop = 0; loop < 10000; loop++) {
v.resize(rlen(rd));
for(int i = 0; i < v.size(); i++)
v[i] = rnum(rd);
if(brutal(v) != smart(v)) {
std::cout << "bad" << std::endl;
return -1;
}
}
std::cout << "good" << std::endl;
}

I'll write in pseudo code because I don't have much time, but this should be the most performing way using recursion
compare(array, left, right)
val = array[left] + array[right] + (right - left);
if (right - left) > 1
val1 = compare(array, left, right-1);
val2 = compare(array, left+1, right);
val = Max(Max(val1,val2),val);
end if
return val
and than you call simply
compare(array,0,array.length);
I think I found a incredibly faster solution but you need to check it:
you need to rewrite your array as follow
Array[i] = array[i] + (MOD((array.lenght / 2) - i));
Then you just find the 2 highest value of the array and sum them, that should be your solution, almost O(n)
wait maybe I'm missing something... I have to check.
Ok you get the 2 highest value from this New Array, and save the positions i, and j. Then you need to calculate from the original array your result.
------------ EDIT
This should be an implementation of the method suggested by Tony D (in c#) that I tested.
int best_i, best_j, max_i, currentMax;
best_i = 0;
best_j = 0;
max_i = 0;
currentMax = 0;
for (int n = 0; n < array.Count; n++)
{
if (array[n] - n > array[max_i] - max_i) max_i = n;
if (array[n] + array[max_i] - (n - max_i) > currentMax)
{
best_i = max_i;
best_j = n;
currentMax = array[n] + array[max_i] - (n - max_i);
}
}
return currentMax;

Question:
Given an int array A[], define X=A[i]+A[j]+(j-i), j>=i. Find max value of X?
Answer O(n):
lets rewrite the formula: X = A[i]-i + A[j]+j
we can track the highest A[i]-i we got and the highest A[j]+j we got. We loop over the array once and update both of our max values. After looping once we return the sum of A[i]-i + A[j]+j, which equals X.
We absolutely don't care about the j>=i constraint, because it is always true when we maximize both A[i]-i and A[j]+j
Code:
int solution(vector<int> &A){
if(A.empty()) return -1;
long long max_Ai_part =-2000000000;
long long max_Aj_part =-2000000000;
int size = A.size();
for(int i=0;i<size;i++){
if(max_Ai_part < A[i] - i)
max_Ai_part = A[i] - i;
if(max_Aj_part < A[j] + j)
max_Ai_part = A[j] - j;
}
return max_Ai_part + max_Aj_part;
}
Bonus:
most people get confused with the j>=i constraint. If you have a feeling for numbers, you should be able to see that i should tend to be lower than j.
Assume we have our formula, it is maximized and i > j. (this is impossible, but lets check it out)
we define x1 := j-i and x2 = i-j
A[i]+A[j]+j-i = A[i]+A[j] + x1, x1 < 0
we could then swap i with j and end up with this:
A[j]+A[i]+i-j = A[i]+A[j] + x2, x2 > 0
it is basically the same formula, but now because i > j the second formula will be greater than the first. In other words we could increase the maximum by swapping i and j which can't be true if we already had the maximum.
If we ever find a maximum, i cannot be greater than j.

Dynamic Programming for analyzing a Vector of Vector of Bools

Here's the problem I'm trying to solve.
Given a square of bools, I want to find the size of largest subsquare entirely full of trues (1's). Also, I am allowed O(n^2) memory requirement as well as the run time must be O(n^2). The header to the function will look like the following
unsigned int largestCluster(const vector<vector<bool>> &map);
Some other things to note will be there always be at least one 1 (a 1 x 1 subsquare) and the input will also always be a square.
Now for my attempts at the problem:
Given this is based on the concept of dynamic programming, which to my limited understanding, helps store information that is previously found for later use. So if my understanding is correcting, Prim's algorithm would be an example of a dynamic algorithm because it remembers what vertices we've visited, the smallest distance to a vertice, and the parent that enables that smallest distance.
I tried analyzing the map and keeping track of the number of true neighbors, a true location location has. I was thinking if a spot had 4 true neighbors than that is a potential subsquare. However, this didn't help with subsquares of size 4 or less..
I tried to include a lot of detail in this question for help as I'm trying to game plan a way to tackle this problem because I don't believe it's going to require writing a lengthy function. Thanks for any help

Here's my nomination. Dynamic programming, O(n^2) complexity. I realize that I probably just did somebody's homework, but it looked like an intriguing little problem.
int largestCluster(const std::vector<std::vector<bool> > a)
{
const int n = a.size();
std::vector<std::vector<short> > s;
s.resize(n);
for (int i = 0; i < n; ++i)
{
s[i].resize(n);
}
s[0][0] = a[0][0] ? 1 : 0;
int maxSize = s[0][0];
for (int k = 1; k < n; ++k)
{
s[k][0] = a[k][0] ? 1 : 0;
for (int j = 1; j < k; ++j)
{
if (a[k][j])
{
int m = s[k - 1][j - 1];
if (s[k][j - 1] < m)
{
m = s[k][j - 1];
}
if (s[k - 1][j] < m)
{
m = s[k - 1][j];
}
s[k][j] = ++m;
if (m > maxSize)
{
maxSize = m;
}
}
else
{
s[k][j] = 0;
}
}
s[0][k] = a[0][k] ? 1 : 0;
for (int i = 1; i <= k; ++i)
{
if (a[i][k])
{
int m = s[i - 1][k - 1];
if (s[i - 1][k] < m)
{
m = s[i - 1][k];
}
if (s[i][k - 1] < m)
{
m = s[i][k - 1];
}
s[i][k] = ++m;
if (m > maxSize)
{
maxSize = m;
}
}
else
{
s[i][k] = 0;
}
}
}
return maxSize;
}

If you want a dynamic programming approach one strategy I could think of would be to consider a box (base case 1 entry) as a potential upper left corner of a larger box and start by the bottom right corner of your large square, you then need to evaluate only the "boxes" (using information previously stored to only consider the largest cluster so far) that are to the right, bottom, and diagonally right-bottom of that we are now evaluating.
By saving information about each edge we would be respecting the O(n^2) (though not o(n^2)) however for the run-time you need to work on the details of the approach to get to O(n^2)
This is just a rough draft idea as I don't have much time, and I would appreciate any more hints/comments about this myself.

What is the fastest way to find longest 'consecutive numbers' streak in vector ?

I have a sorted std::vector<int> and I would like to find the longest 'streak of consecutive numbers' in this vector and then return both the length of it and the smallest number in the streak.
To visualize it for you :
suppose we have :
1 3 4 5 6 8 9
I would like it to return: maxStreakLength = 4 and streakBase = 3
There might be occasion where there will be 2 streaks and we have to choose which one is longer.
What is the best (fastest) way to do this ? I have tried to implement this but I have problems with coping with more than one streak in the vector. Should I use temporary vectors and then compare their lengths?

No you can do this in one pass through the vector and only storing the longest start point and length found so far. You also need much fewer than 'N' comparisons. *
hint: If you already have say a 4 long match ending at the 5th position (=6) and which position do you have to check next?
[*] left as exercise to the reader to work out what's the likely O( ) complexity ;-)

It would be interesting to see if the fact that the array is sorted can be exploited somehow to improve the algorithm. The first thing that comes to mind is this: if you know that all numbers in the input array are unique, then for a range of elements [i, j] in the array, you can immediately tell whether elements in that range are consecutive or not, without actually looking through the range. If this relation holds
array[j] - array[i] == j - i
then you can immediately say that elements in that range are consecutive. This criterion, obviously, uses the fact that the array is sorted and that the numbers don't repeat.
Now, we just need to develop an algorithm which will take advantage of that criterion. Here's one possible recursive approach:
Input of recursive step is the range of elements [i, j]. Initially it is [0, n-1] - the whole array.
Apply the above criterion to range [i, j]. If the range turns out to be consecutive, there's no need to subdivide it further. Send the range to output (see below for further details).
Otherwise (if the range is not consecutive), divide it into two equal parts [i, m] and [m+1, j].
Recursively invoke the algorithm on the lower part ([i, m]) and then on the upper part ([m+1, j]).
The above algorithm will perform binary partition of the array and recursive descent of the partition tree using the left-first approach. This means that this algorithm will find adjacent subranges with consecutive elements in left-to-right order. All you need to do is to join the adjacent subranges together. When you receive a subrange [i, j] that was "sent to output" at step 2, you have to concatenate it with previously received subranges, if they are indeed consecutive. Or you have to start a new range, if they are not consecutive. All the while you have keep track of the "longest consecutive range" found so far.
That's it.
The benefit of this algorithm is that it detects subranges of consecutive elements "early", without looking inside these subranges. Obviously, it's worst case performance (if ther are no consecutive subranges at all) is still O(n). In the best case, when the entire input array is consecutive, this algorithm will detect it instantly. (I'm still working on a meaningful O estimation for this algorithm.)
The usability of this algorithm is, again, undermined by the uniqueness requirement. I don't know whether it is something that is "given" in your case.
Anyway, here's a possible C++ implementation
typedef std::vector<int> vint;
typedef std::pair<vint::size_type, vint::size_type> range;
class longest_sequence
{
public:
const range& operator ()(const vint &v)
{
current = max = range(0, 0);
process_subrange(v, 0, v.size() - 1);
check_record();
return max;
}
private:
range current, max;
void process_subrange(const vint &v, vint::size_type i, vint::size_type j);
void check_record();
};
void longest_sequence::process_subrange(const vint &v,
vint::size_type i, vint::size_type j)
{
assert(i <= j && v[i] <= v[j]);
assert(i == 0 || i == current.second + 1);
if (v[j] - v[i] == j - i)
{ // Consecutive subrange found
assert(v[current.second] <= v[i]);
if (i == 0 || v[i] == v[current.second] + 1)
// Append to the current range
current.second = j;
else
{ // Range finished
// Check against the record
check_record();
// Start a new range
current = range(i, j);
}
}
else
{ // Subdivision and recursive calls
assert(i < j);
vint::size_type m = (i + j) / 2;
process_subrange(v, i, m);
process_subrange(v, m + 1, j);
}
}
void longest_sequence::check_record()
{
assert(current.second >= current.first);
if (current.second - current.first > max.second - max.first)
// We have a new record
max = current;
}
int main()
{
int a[] = { 1, 3, 4, 5, 6, 8, 9 };
std::vector<int> v(a, a + sizeof a / sizeof *a);
range r = longest_sequence()(v);
return 0;
}

I believe that this should do it?
size_t beginStreak = 0;
size_t streakLen = 1;
size_t longest = 0;
size_t longestStart = 0;
for (size_t i=1; i < len.size(); i++) {
if (vec[i] == vec[i-1] + 1) {
streakLen++;
}
else {
if (streakLen > longest) {
longest = streakLen;
longestStart = beginStreak;
}
beginStreak = i;
streakLen = 1;
}
}
if (streakLen > longest) {
longest = streakLen;
longestStart = beginStreak;
}

You can't solve this problem in less than O(N) time. Imagine your list is the first N-1 even numbers, plus a single odd number (chosen from among the first N-1 odd numbers). Then there is a single streak of length 3 somewhere in the list, but worst case you need to scan the entire list to find it. Even on average you'll need to examine at least half of the list to find it.

Similar to Rodrigo's solutions but solving your example as well:
#include <vector>
#include <cstdio>
#define len(x) sizeof(x) / sizeof(x[0])
using namespace std;
int nums[] = {1,3,4,5,6,8,9};
int streakBase = nums[0];
int maxStreakLength = 1;
void updateStreak(int currentStreakLength, int currentStreakBase) {
if (currentStreakLength > maxStreakLength) {
maxStreakLength = currentStreakLength;
streakBase = currentStreakBase;
}
}
int main(void) {
vector<int> v;
for(size_t i=0; i < len(nums); ++i)
v.push_back(nums[i]);
int lastBase = v[0], currentStreakBase = v[0], currentStreakLength = 1;
for(size_t i=1; i < v.size(); ++i) {
if (v[i] == lastBase + 1) {
currentStreakLength++;
lastBase = v[i];
} else {
updateStreak(currentStreakLength, currentStreakBase);
currentStreakBase = v[i];
lastBase = v[i];
currentStreakLength = 1;
}
}
updateStreak(currentStreakLength, currentStreakBase);
printf("maxStreakLength = %d and streakBase = %d\n", maxStreakLength, streakBase);
return 0;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js