Calculating the big-O complexity of this string match function? - c++

can anyone help me calculate the complexity of the following?
I've written a strStr function for homework, and although it's not part of my homework, I want to figure out the complexity of it.
basically it takes a string, finds 1st occurence of substring, returns it's index,
I believe it O(n), because although it's double loop'd at most it'll run only n times, where n is the length of s1, am I correct?
int strStr( char s1[] , char s2[] ){
int haystackInd, needleInd;
bool found = false;
needleInd = haystackInd = 0;
while ((s1[haystackInd] != '\0') && (!found)){
while ( (s1[haystackInd] == s2[needleInd]) && (s2[needleInd] != '\0') ){
needleInd++;
haystackInd++;
}
if (s2[needleInd] == '\0'){
found = true;
}else{
if (needleInd != 0){
needleInd = 0;
}
else{
haystackInd++;
}
}
}
if (found){
return haystackInd - needleInd;
}
else{
return -1;
}
}

It is indeed O(n), but it is also not functioning properly. Consider finding "nand" in "nanand"
There is an O(n) solution to the problem though.

Actually, the outer loop could run 2n times (each iteration increments haystackInd at least once OR it sets needleInd to 0, but never sets needleInd to 0 in 2 successive iterations), but you end up w/ the same O(n) complexity.

Your algorithm isn't correct. The indices, haystackInd, in your solution are incorrect. But your conclusion based on your wrong algorithm was right. It is O(n), but just it can't find the first occurrence of the substring. The most trivial solution is like yours, compare string S2 to substrings starting from S1[0], S1[1],...And the running time is O(n^2). If you want O(n) one, you should check out KMP algorithm as templatetypedef mentioned above.

Related

reduce time complexity in checking if a substring of a string is palindromic

this is a simple program to check for a substring which is a palindrome.
it works fine for string of length 1000 but gives TLE error on SPOJ for a length of 100000. how shall i optimize this code. saving all the substrings will not work for such large inputs. the time limit is 1 sec so we can do at most 10^6-10^7 iterations. is there any other way i can do it.
#include<bits/stdc++.h>
int main()
{
int t;
std::cin>>t;
if(t<1||t>10)
return 0;
while(t--)
{
std::string s;
std::cin>>s;
//std::cout<<s.substr(0,1);
//std::vector<std::string>s1;
int n=s.length();
if(n<1||n>100000)
return 0;
int len,mid,k=0,i=0;
for(i=0;i<n-1;i++)
{
for(int j=2;j<=n-i;j++)
{
std::string ss=s.substr(i,j);
//s1.push_back(ss);
len=ss.length();
mid=len/2;
while(k<=mid&&(len-1-k)>=mid&&len>1)
{
if(ss[k]!=ss[len-1-k])
break;
k++;
}
if(k>mid||(len-1-k)<mid)
{
std::cout<<"YES"<<std::endl;
break;
}
}
if(k>mid||(len-1-k)<mid)
break;
}
if(i==n-1)
std::cout<<"NO"<<std::endl;
//for(i=0;i<m;i++)
// std::cout<<s1[i]<<std::endl
}
return 0;
}
Your assumption that saving all sub-strings in another vector and checking them later with same O(N^2) approach will not help you to reduce time complexity of your algorithm. Instead, it will increases your memory complexity too. Holding all possible sub-strings in another vector will take lots of memory.
Since the size of string could be maximum of 10^5. To check if there exist any palindromic sub-string should be done either in O(NlogN) or O(N) time complexity to pass within time limit. For this, I suggest you two algorithms:
1.) Suffix array : Link here
2.) Manacher’s Algorithm: Link here
I'm not completely sure what your function is trying to accomplish... are you finding t palindromic substrings?
To save on memory, rather than store every substring in a vector and then iterating over the vector to check for palindromes, why not just check if the substring is a palindrome as you generate them?
std::string ss = s.substr(i,j);
// s1.push_back(ss); // Don't store the substrings
if (palindromic(ss)) {
std::cout << "YES" << std::endl;
break;
}
This saves some time as most cases, as you no longer always generate every possible substring. However, it is not guaranteed to be much faster in the worst case.

C++ if statement order

A portion of a program needs to check if two c-strings are identical while searching though an ordered list (e.g.{"AAA", "AAB", "ABA", "CLL", "CLZ"}). It is feasible that the list could get quite large, so small improvements in speed are worth degradation of readability. Assume that you are restricted to C++ (please don't suggest switching to assembly). How can this be improved?
typedef char StringC[5];
void compare (const StringC stringX, const StringC stringY)
{
// use a variable so compareResult won't have to be computed twice
int compareResult = strcmp(stringX, stringY);
if (compareResult < 0) // roughly 50% chance of being true, so check this first
{
// no match. repeat with a 'lower' value string
compare(stringX, getLowerString() );
}
else if (compareResult > 0) // roughly 49% chance of being true, so check this next
{
// no match. repeat with a 'higher' value string
compare(stringX, getHigherString() );
}
else // roughly 1% chance of being true, so check this last
{
// match
reportMatch(stringY);
}
}
You can assume that stringX and stringY are always the same length and you won't get any invalid data input.
From what I understand, a compiler will make the code so that the CPU will check the first if-statement and jump if it's false, so it would be best if that first statement is the most likely to be true, as jumps interfere with the pipeline. I have also heard that when doing a compare, a[n Intel] CPU will do a subtraction and look at the status of flags without saving the subtraction's result. Would there be a way to do the strcmp once, without saving the result into a variable, but still being able to check that result during the both of the first two if-statements?
std::binary_search may help:
bool cstring_less(const char (&lhs)[4], const char (&rhs)[4])
{
return std::lexicographical_compare(std::begin(lhs), std::end(lhs),
std::begin(rhs), std::end(rhs));
}
int main(int, char**)
{
const char cstrings[][4] = {"AAA", "AAB", "ABA", "CLL", "CLZ"};
const char lookFor[][4] = {"BBB", "ABA", "CLS"};
for (const auto& s : lookFor)
{
if (std::binary_search(std::begin(cstrings), std::end(cstrings),
s, cstring_less))
{
std::cout << s << " Found.\n";
}
}
}
Demo
I think using hash tables can improve the speed of comparison drastically. Also, if your program is multithreaded, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map
I hope it helps you.
If you try to compare all the strings to each other you'll get in a O(N*(N-1)) problem. The best thing, as you have stated the lists can grow large, is to sort them (qsort algorithm has O(N*log(N))) and then compare each element with the next one in the list, which adds a new O(N) giving up to O(N*log(N)) total complexity. As you have the list already ordered, you can just traverse it (making the thing O(N)), comparing each element with the next. An example, valid in C and C++ follows:
for(i = 0; i < N-1; i++) /* one comparison less than the number of elements */
if (strcmp(array[i], array[i+1]) == 0)
break;
if (i < N-1) { /* this is a premature exit from the loop, so we found a match */
/* found a match, array[i] equals array[i+1] */
} else { /* we exhausted al comparisons and got out normally from the loop */
/* no match found */
}

Need suggestion to improve speed for word break (dynamic programming)

The problem is: Given a string s and a dictionary of words dict, determine if s can be segmented into a space-separated sequence of one or more dictionary words.
For example, given
s = "hithere",
dict = ["hi", "there"].
Return true because "hithere" can be segmented as "leet code".
My implementation is as below. This code is ok for normal cases. However, it suffers a lot for input like:
s = "aaaaaaaaaaaaaaaaaaaaaaab", dict = {"aa", "aaaaaa", "aaaaaaaa"}.
I want to memorize the processed substrings, however, I cannot done it right. Any suggestion on how to improve? Thanks a lot!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
for(int i(0); i<len; i++) {
string tmp = s.substr(0, i+1);
if((wordDict.find(tmp)!=wordDict.end())
&& (wordBreak(s.substr(i+1), wordDict)) )
return true;
}
return false;
}
};
It's logically a two-step process. Find all dictionary words within the input, consider the found positions (begin/end pairs), and then see if those words cover the whole input.
So you'd get for your example
aa: {0,2}, {1,3}, {2,4}, ... {20,22}
aaaaaa: {0,6}, {1,7}, ... {16,22}
aaaaaaaa: {0,8}, {1,9} ... {14,22}
This is a graph, with nodes 0-23 and a bunch of edges. But node 23 b is entirely unreachable - no incoming edge. This is now a simple graph theory problem
Finding all places where dictionary words occur is pretty easy, if your dictionary is organized as a trie. But even an std::map is usable, thanks to its equal_range method. You have what appears to be an O(N*N) nested loop for begin and end positions, with O(log N) lookup of each word. But you can quickly determine if s.substr(begin,end) is a still a viable prefix, and what dictionary words remain with that prefix.
Also note that you can build the graph lazily. Staring at begin=0 you find edges {0,2}, {0,6} and {0,8}. (And no others). You can now search nodes 2, 6 and 8. You even have a good algorithm - A* - that suggests you try node 8 first (reachable in just 1 edge). Thus, you'll find nodes {8,10}, {8,14} and {8,16} etc. As you see, you'll never need to build the part of the graph that contains {1,3} as it's simply unreachable.
Using graph theory, it's easy to see why your brute-force method breaks down. You arrive at node 8 (aaaaaaaa.aaaaaaaaaaaaaab) repeatedly, and each time search the subgraph from there on.
A further optimization is to run bidirectional A*. This would give you a very fast solution. At the second half of the first step, you look for edges leading to 23, b. As none exist, you immediately know that node {23} is isolated.
In your code, you are not using dynamic programming because you are not remembering the subproblems that you have already solved.
You can enable this remembering, for example, by storing the results based on the starting position of the string s within the original string, or even based on its length (because anyway the strings you are working with are suffixes of the original string, and therefore its length uniquely identifies it). Then, in the beginning of your wordBreak function, just check whether such length has already been processed and, if it has, do not rerun the computations, just return the stored value. Otherwise, run computations and store the result.
Note also that your approach with unordered_set will not allow you to obtain the fastest solution. The fastest solution that I can think of is O(N^2) by storing all the words in a trie (not in a map!) and following this trie as you walk along the given string. This achieves O(1) per loop iteration not counting the recursion call.
Thanks for all the comments. I changed my previous solution to the implementation below. At this point, I didn't explore to optimize on the dictionary, but those insights are very valuable and are very much appreciated.
For the current implementation, do you think it can be further improved? Thanks!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
if(wordDict.size()==0) return false;
vector<bool> dq (len+1,false);
dq[0] = true;
for(int i(0); i<len; i++) {// start point
if(dq[i]) {
for(int j(1); j<=len-i; j++) {// length of substring, 1:len
if(!dq[i+j]) {
auto pos = wordDict.find(s.substr(i, j));
dq[i+j] = dq[i+j] || (pos!=wordDict.end());
}
}
}
if(dq[len]) return true;
}
return false;
}
};
Try the following:
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict)
{
for (auto w : wordDict)
{
auto pos = s.find(w);
if (pos != string::npos)
{
if (wordBreak(s.substr(0, pos), wordDict) &&
wordBreak(s.substr(pos + w.size()), wordDict))
return true;
}
}
return false;
}
};
Essentially one you find a match remove the matching part from the input string and so continue testing on a smaller input.

Finding all occurrences using rfind, flow challenges?

Following a c++ tutorial and teaching about find() the following code was implemented to search for all the "cat" occurrences in a string:
std::string input;
std::size_t i = 0, x_appearances = 0;
std::getline(std::cin,input);
for(i = input.find("cat",0); i != std::string::npos; i=input.find("cat", i))
{
++x_appearances;
++i; //Move past the last discovered instance to avoid finding the same string
}
Then the tutorial challenges the apprentice to change find() for rfind(), and that's where the problems came in, first I tried what seemed to be the obvious approach:
for(i = input.rfind("cat",input.length()); i != std::string::npos; i=input.rfind("cat", i))
{
++x_appearances;
--i; //Move past the last discovered instance to avoid finding the same string
}
but with this solution I fell into an infinite loop. Then I discovered that it was happening because the increment is performed before the condition check, and that the increment rfind() was always finding a match even with i==std::string::npos (if the match is on the beginning of the string, for example "cats"). My final solution came to be:
int n=input.length();
for(i = input.rfind("cat",input.length()); n>0 && i!=std::string::npos; i=input.rfind("cat", i))
{
++x_appearances;
n=i;
--i; //Move past the last discovered instance to avoid finding the same string
}
With n I can keep the track of the position in the string, and with it exit the for loop when the entire string had been searched.
So my question is: Is my approach correct? Did I need an extra variable or is there any other simpler way of doing this?
for(i = input.rfind("cat",input.length()); i != std::string::npos; i=input.rfind("cat", i))
{
++x_appearances;
--i; //Move past the last discovered instance to avoid finding the same string
}
The problem with the above is the --i inside the loop. Suppose the input string starts with "cat". Your algorithm will eventually find that "cat" with i being 0. Since you've declared i as a std::size_t, subtracting 1 from 0 results in the largest possible std::size_t. There's no warning, no overflow, no undefined behavior. This is exactly how unsigned integers must work, per the standard.
Somehow you need to handle this special case. You could use an auxiliary variable and a more convoluted test in your loop. An alternative is to keep your code simple and at the same time make it blatantly obvious you are explicitly handling this special case:
for (i = input.rfind("cat"); i != std::string::npos; i=input.rfind("cat", i-1))
{
++x_appearances;
// Finding "cat" at the start means we're done.
if (i == 0) {
break;
}
}
Note also that I've changed the loop statement a bit. The default value for pos is std::string::npos, which means search from the end of the string. There's no need for that second argument with the initializer. I also moved the --i into the update part of the for loop, changing input.rfind("cat",i) to input.rfind("cat",i-1). Since i is always positive at this point, there's no danger in subtracting one.

Why is my binary search so insanely slow compared to an iterative?

I'm writing an autocomplete program that finds all possible matches to a letter or set of characters given a dictionary file and input file. I just finished a version that implements a binary search over an iterative search and thought I could boost the overall performance of the program.
Thing is, the binary search is almost 9 times slower than an iterative search. What gives? I thought I was improving performance by using a binary search over iterative.
Run time(bin search to the left)[Larger]:
Here is the important part of each version, full code can be built and run at my github with cmake.
Binary Search Function(called while looping through input given)
bool search(std::vector<std::string>& dict, std::string in,
std::queue<std::string>& out)
{
//tick makes sure the loop found at least one thing. if not then break the function
bool tick = false;
bool running = true;
while(running) {
//for each element in the input vector
//find all possible word matches and push onto the queue
int first=0, last= dict.size() -1;
while(first <= last)
{
tick = false;
int middle = (first+last)/2;
std::string sub = (dict.at(middle)).substr(0,in.length());
int comp = in.compare(sub);
//if comp returns 0(found word matching case)
if(comp == 0) {
tick = true;
out.push(dict.at(middle));
dict.erase(dict.begin() + middle);
}
//if not, take top half
else if (comp > 0)
first = middle + 1;
//else go with the lower half
else
last = middle - 1;
}
if(tick==false)
running = false;
}
return true;
}
Iterative Search(included in main loop):
for(int k = 0; k < input.size(); ++k) {
int len = (input.at(k)).length();
// truth false variable to end out while loop
bool found = false;
// create an iterator pointing to the first element of the dictionary
vecIter i = dictionary.begin();
// this while loop is not complete, a condition needs to be made
while(!found && i != dictionary.end()) {
// take a substring the dictionary word(the length is dependent on
// the input value) and compare
if( (*i).substr(0,len) == input.at(k) ) {
// so a word is found! push onto the queue
matchingCase.push(*i);
}
// move iterator to next element of data
++i;
}
}
example input file:
z
be
int
nor
tes
terr
on
Instead of erasing elements in the middle of the vector (which is quite expensive), and then starting your search over, just compare the elements before and after the found item (because they should all be adjacent to eachother) until you find the all the items which match.
Or use std::equal_range, which does exactly that.
This will be the culprit:
dict.erase(dict.begin() + middle);
You are repeatedly removing items from your dictionary to naively use binary search to find all valid prefixes. This adds huge complexity, and is unnecessary.
Instead, once you have found a match, step backwards until you find the first match, then step forwards, adding all matches to your queue. Remember that because your dictionary is sorted and you are using only the prefixes, all valid matches will appear consecutively.
dict.erase operation is linear in the size of dict: it copies the entire array from middle to end into the beginning of the array. This makes the "binary search" algorithm possible quadratic in the length of dict, with O(N^2) expensive memory copy operations.