Need suggestion to improve speed for word break (dynamic programming)

Need suggestion to improve speed for word break (dynamic programming) - c++

The problem is: Given a string s and a dictionary of words dict, determine if s can be segmented into a space-separated sequence of one or more dictionary words.
For example, given
s = "hithere",
dict = ["hi", "there"].
Return true because "hithere" can be segmented as "leet code".
My implementation is as below. This code is ok for normal cases. However, it suffers a lot for input like:
s = "aaaaaaaaaaaaaaaaaaaaaaab", dict = {"aa", "aaaaaa", "aaaaaaaa"}.
I want to memorize the processed substrings, however, I cannot done it right. Any suggestion on how to improve? Thanks a lot!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
for(int i(0); i<len; i++) {
string tmp = s.substr(0, i+1);
if((wordDict.find(tmp)!=wordDict.end())
&& (wordBreak(s.substr(i+1), wordDict)) )
return true;
}
return false;
}
};

It's logically a two-step process. Find all dictionary words within the input, consider the found positions (begin/end pairs), and then see if those words cover the whole input.
So you'd get for your example
aa: {0,2}, {1,3}, {2,4}, ... {20,22}
aaaaaa: {0,6}, {1,7}, ... {16,22}
aaaaaaaa: {0,8}, {1,9} ... {14,22}
This is a graph, with nodes 0-23 and a bunch of edges. But node 23 b is entirely unreachable - no incoming edge. This is now a simple graph theory problem
Finding all places where dictionary words occur is pretty easy, if your dictionary is organized as a trie. But even an std::map is usable, thanks to its equal_range method. You have what appears to be an O(N*N) nested loop for begin and end positions, with O(log N) lookup of each word. But you can quickly determine if s.substr(begin,end) is a still a viable prefix, and what dictionary words remain with that prefix.
Also note that you can build the graph lazily. Staring at begin=0 you find edges {0,2}, {0,6} and {0,8}. (And no others). You can now search nodes 2, 6 and 8. You even have a good algorithm - A* - that suggests you try node 8 first (reachable in just 1 edge). Thus, you'll find nodes {8,10}, {8,14} and {8,16} etc. As you see, you'll never need to build the part of the graph that contains {1,3} as it's simply unreachable.
Using graph theory, it's easy to see why your brute-force method breaks down. You arrive at node 8 (aaaaaaaa.aaaaaaaaaaaaaab) repeatedly, and each time search the subgraph from there on.
A further optimization is to run bidirectional A*. This would give you a very fast solution. At the second half of the first step, you look for edges leading to 23, b. As none exist, you immediately know that node {23} is isolated.

In your code, you are not using dynamic programming because you are not remembering the subproblems that you have already solved.
You can enable this remembering, for example, by storing the results based on the starting position of the string s within the original string, or even based on its length (because anyway the strings you are working with are suffixes of the original string, and therefore its length uniquely identifies it). Then, in the beginning of your wordBreak function, just check whether such length has already been processed and, if it has, do not rerun the computations, just return the stored value. Otherwise, run computations and store the result.
Note also that your approach with unordered_set will not allow you to obtain the fastest solution. The fastest solution that I can think of is O(N^2) by storing all the words in a trie (not in a map!) and following this trie as you walk along the given string. This achieves O(1) per loop iteration not counting the recursion call.

Thanks for all the comments. I changed my previous solution to the implementation below. At this point, I didn't explore to optimize on the dictionary, but those insights are very valuable and are very much appreciated.
For the current implementation, do you think it can be further improved? Thanks!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
if(wordDict.size()==0) return false;
vector<bool> dq (len+1,false);
dq[0] = true;
for(int i(0); i<len; i++) {// start point
if(dq[i]) {
for(int j(1); j<=len-i; j++) {// length of substring, 1:len
if(!dq[i+j]) {
auto pos = wordDict.find(s.substr(i, j));
dq[i+j] = dq[i+j] || (pos!=wordDict.end());
}
}
}
if(dq[len]) return true;
}
return false;
}
};

Try the following:
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict)
{
for (auto w : wordDict)
{
auto pos = s.find(w);
if (pos != string::npos)
{
if (wordBreak(s.substr(0, pos), wordDict) &&
wordBreak(s.substr(pos + w.size()), wordDict))
return true;
}
}
return false;
}
};
Essentially one you find a match remove the matching part from the input string and so continue testing on a smaller input.

Related

C++ - checking a string for all values in an array

I have some parsed text from the Vision API, and I'm filtering it using keywords, like so:
if (finalTextRaw.find("File") != finalTextRaw.npos)
{
LogMsg("Found Menubar");
}
E.g., if the keyword "File" is found anywhere within the string finalTextRaw, then the function is interrupted and a log message is printed.
This method is very reliable. But I've inefficiently just made a bunch of if-else-if statements in this fashion, and as I'm finding more words that need filtering, I'd rather be a little more efficient. Instead, I'm now getting a string from a config file, and then parsing that string into an array:
string filterWords = GetApp()->GetFilter();
std::replace(filterWords.begin(), filterWords.end(), ',', ' '); ///replace ',' with ' '
vector<int> array;
stringstream ss(filterWords);
int temp;
while (ss >> temp)
array.push_back(temp); ///create an array of filtered words
And I'd like to have just one if statement for checking that string against the array, instead of many of them for checking the string against each keyword I'm having to manually specify in the code. Something like this:
if (finalTextRaw.find(array) != finalTextRaw.npos)
{
LogMsg("Found filtered word");
}
Of course, that syntax doesn't work, and it's surely more complicated than that, but hopefully you get the idea: if any words from my array appear anywhere in that string, that string should be ignored and a log message printed instead.
Any ideas how I might fashion such a function? I'm guessing it's going to necessitate some kind of loop.

Borrowing from Thomas's answer, a ranged for loop offers a neat solution:
for (const auto &word : words)
{
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}

As pointed out by Thomas, the most efficient way is to split both texts into a list of words. Then use std::set_intersection to find occurrences in both lists. You can use std::vector as long it is sorted. You end up with O(n*log(n)) (with n = max words), rather than O(n*m).
Split sentences to words:
auto split(std::string_view sentence) {
auto result = std::vector<std::string>{};
auto stream = std::istringstream{sentence.data()};
std::copy(std::istream_iterator<std::string>(stream),
std::istream_iterator<std::string>(), std::back_inserter(result));
return result;
}
Find words existing in both lists. This only works for sorted lists (like sets or manually sorted vectors).
auto intersect(std::vector<std::string> a, std::vector<std::string> b) {
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
auto result = std::vector<std::string>{};
std::set_intersection(std::move_iterator{a.begin()},
std::move_iterator{a.end()},
b.cbegin(), b.cend(),
std::back_inserter(result));
return result;
}
Example of how to use.
int main() {
const auto result = intersect(split("hello my name is mister raw"),
split("this is the final raw text"));
for (const auto& word: result) {
// do something with word
}
}
Note that this makes sense when working with large or undefined number of words. If you know the limits, you might want to use easier solutions (provided by other answers).

You could use a fundamental, brute force, loop:
unsigned int quantity_words = array.size();
for (unsigned int i = 0; i < quantity_words; ++i)
{
std::string word = array[i];
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
The above loop takes each word in the array and searches the finalTextRaw for the word.
There are better methods using some std algorithms. I'll leave that for other answers.
Edit 1: maps and association
The above code is bothering me because there are too many passes through the finalTextRaw string.
Here's another idea:
Create a std::set using the words in finalTextRaw.
For each word in your array, check for existence in the set.
This reduces the quantity of searches (it's like searching a tree).
You should also investigate creating a set of the words in array and finding the intersection between the two sets.

how to find set of distinct strings from a given string after cyclic shifts?

I am solving a [QUESTION][1] in Codeforces where the problem statement asks me to find the set of all distinct strings from a given string after cyclic shifts.
like for example :
Given string :"abcd"
the output should be 4 ("dabc","cdab", "bcda", "abcd")[note:"abcd" is also counted]
So
t=s[l-1];
for(i=l-1;i>0;i--)
{
s[i]=s[i-1];
}
s[0]=t;
I applied above method for length - 1 times for all possible strings but I am unable to find the distinct ones,
is there any STL function to do this?

You may use the following:
std::set<std::string>
retrieve_unique_rotations(std::string s)
{
std::set<std::string> res;
res.insert(s);
if (s.empty()) {
return res;
}
for (std::size_t i = 0, size = s.size() - 1; i != size; ++i) {
std::rotate(s.begin(), s.begin() + 1, s.end());
res.insert(s);
}
return res;
}
Demo

Not sure about STL specific functions, however a general solution could be to have all shifted strings in a list. Then you sort the list and then you iterate over the list elements. When the current element is different to the last, increment the counter.
There is probably a solution that is less memory intensive. For short strings this solution should be sufficient.

You can use vector for making a list after rotating by using vector.push_back("string"). Before each push, You can check if it already exists by using something like:
if (std::find(vector.begin(), vector.end(), "string") != v.end())
{
increment++;
vector.push_back("string");
}
Or else you can count the elements in the end by vector.size(); and remove increment++.
Hope this helps

C++ if statement order

A portion of a program needs to check if two c-strings are identical while searching though an ordered list (e.g.{"AAA", "AAB", "ABA", "CLL", "CLZ"}). It is feasible that the list could get quite large, so small improvements in speed are worth degradation of readability. Assume that you are restricted to C++ (please don't suggest switching to assembly). How can this be improved?
typedef char StringC[5];
void compare (const StringC stringX, const StringC stringY)
{
// use a variable so compareResult won't have to be computed twice
int compareResult = strcmp(stringX, stringY);
if (compareResult < 0) // roughly 50% chance of being true, so check this first
{
// no match. repeat with a 'lower' value string
compare(stringX, getLowerString() );
}
else if (compareResult > 0) // roughly 49% chance of being true, so check this next
{
// no match. repeat with a 'higher' value string
compare(stringX, getHigherString() );
}
else // roughly 1% chance of being true, so check this last
{
// match
reportMatch(stringY);
}
}
You can assume that stringX and stringY are always the same length and you won't get any invalid data input.
From what I understand, a compiler will make the code so that the CPU will check the first if-statement and jump if it's false, so it would be best if that first statement is the most likely to be true, as jumps interfere with the pipeline. I have also heard that when doing a compare, a[n Intel] CPU will do a subtraction and look at the status of flags without saving the subtraction's result. Would there be a way to do the strcmp once, without saving the result into a variable, but still being able to check that result during the both of the first two if-statements?

std::binary_search may help:
bool cstring_less(const char (&lhs)[4], const char (&rhs)[4])
{
return std::lexicographical_compare(std::begin(lhs), std::end(lhs),
std::begin(rhs), std::end(rhs));
}
int main(int, char**)
{
const char cstrings[][4] = {"AAA", "AAB", "ABA", "CLL", "CLZ"};
const char lookFor[][4] = {"BBB", "ABA", "CLS"};
for (const auto& s : lookFor)
{
if (std::binary_search(std::begin(cstrings), std::end(cstrings),
s, cstring_less))
{
std::cout << s << " Found.\n";
}
}
}
Demo

I think using hash tables can improve the speed of comparison drastically. Also, if your program is multithreaded, you can find some useful hash tables in intel thread building blocks library. For example, tbb::concurrent_unordered_map has the same api as std::unordered_map
I hope it helps you.

If you try to compare all the strings to each other you'll get in a O(N*(N-1)) problem. The best thing, as you have stated the lists can grow large, is to sort them (qsort algorithm has O(N*log(N))) and then compare each element with the next one in the list, which adds a new O(N) giving up to O(N*log(N)) total complexity. As you have the list already ordered, you can just traverse it (making the thing O(N)), comparing each element with the next. An example, valid in C and C++ follows:
for(i = 0; i < N-1; i++) /* one comparison less than the number of elements */
if (strcmp(array[i], array[i+1]) == 0)
break;
if (i < N-1) { /* this is a premature exit from the loop, so we found a match */
/* found a match, array[i] equals array[i+1] */
} else { /* we exhausted al comparisons and got out normally from the loop */
/* no match found */
}

Why is my binary search so insanely slow compared to an iterative?

I'm writing an autocomplete program that finds all possible matches to a letter or set of characters given a dictionary file and input file. I just finished a version that implements a binary search over an iterative search and thought I could boost the overall performance of the program.
Thing is, the binary search is almost 9 times slower than an iterative search. What gives? I thought I was improving performance by using a binary search over iterative.
Run time(bin search to the left)[Larger]:
Here is the important part of each version, full code can be built and run at my github with cmake.
Binary Search Function(called while looping through input given)
bool search(std::vector<std::string>& dict, std::string in,
std::queue<std::string>& out)
{
//tick makes sure the loop found at least one thing. if not then break the function
bool tick = false;
bool running = true;
while(running) {
//for each element in the input vector
//find all possible word matches and push onto the queue
int first=0, last= dict.size() -1;
while(first <= last)
{
tick = false;
int middle = (first+last)/2;
std::string sub = (dict.at(middle)).substr(0,in.length());
int comp = in.compare(sub);
//if comp returns 0(found word matching case)
if(comp == 0) {
tick = true;
out.push(dict.at(middle));
dict.erase(dict.begin() + middle);
}
//if not, take top half
else if (comp > 0)
first = middle + 1;
//else go with the lower half
else
last = middle - 1;
}
if(tick==false)
running = false;
}
return true;
}
Iterative Search(included in main loop):
for(int k = 0; k < input.size(); ++k) {
int len = (input.at(k)).length();
// truth false variable to end out while loop
bool found = false;
// create an iterator pointing to the first element of the dictionary
vecIter i = dictionary.begin();
// this while loop is not complete, a condition needs to be made
while(!found && i != dictionary.end()) {
// take a substring the dictionary word(the length is dependent on
// the input value) and compare
if( (*i).substr(0,len) == input.at(k) ) {
// so a word is found! push onto the queue
matchingCase.push(*i);
}
// move iterator to next element of data
++i;
}
}
example input file:
z
be
int
nor
tes
terr
on

Instead of erasing elements in the middle of the vector (which is quite expensive), and then starting your search over, just compare the elements before and after the found item (because they should all be adjacent to eachother) until you find the all the items which match.
Or use std::equal_range, which does exactly that.

This will be the culprit:
dict.erase(dict.begin() + middle);
You are repeatedly removing items from your dictionary to naively use binary search to find all valid prefixes. This adds huge complexity, and is unnecessary.
Instead, once you have found a match, step backwards until you find the first match, then step forwards, adding all matches to your queue. Remember that because your dictionary is sorted and you are using only the prefixes, all valid matches will appear consecutively.

dict.erase operation is linear in the size of dict: it copies the entire array from middle to end into the beginning of the array. This makes the "binary search" algorithm possible quadratic in the length of dict, with O(N^2) expensive memory copy operations.

Low Memory Shortest Path Algorithm

I have a global unique path table which can be thought of as a directed un-weighted graph. Each node represents either a piece of physical hardware which is being controlled, or a unique location in the system. The table contains the following for each node:
A unique path ID (int)
Type of component (char - 'A' or 'L')
String which contains a comma separated list of path ID's which that node is connected to (char[])
I need to create a function which given a starting and ending node, finds the shortest path between the two nodes. Normally this is a pretty simple problem, but here is the issue I am having. I have a very limited amount of memory/resources, so I cannot use any dynamic memory allocation (ie a queue/linked list). It would also be nice if it wasn't recursive (but it wouldn't be too big of an issue if it was as the table/graph itself if really small. Currently it has 26 nodes, 8 of which will never be hit. At worst case there would be about 40 nodes total).
I started putting something together, but it doesn't always find the shortest path. The pseudo code is below:
bool shortestPath(int start, int end)
if start == end
if pathTable[start].nodeType == 'A'
Turn on part
end if
return true
else
mark the current node
bool val
for each node in connectedNodes
if node is not marked
val = shortestPath(node.PathID, end)
end if
end for
if val == true
if pathTable[start].nodeType == 'A'
turn on part
end if
return true
end if
end if
return false
end function
Anyone have any ideas how to either fix this code, or know something else that I could use to make it work?
----------------- EDIT -----------------
Taking Aasmund's advice, I looked into implementing a Breadth First Search. Below I have some c# code which I quickly threw together using some pseudo code I found online.
pseudo code found online:
Input: A graph G and a root v of G
procedure BFS(G,v):
create a queue Q
enqueue v onto Q
mark v
while Q is not empty:
t ← Q.dequeue()
if t is what we are looking for:
return t
for all edges e in G.adjacentEdges(t) do
u ← G.adjacentVertex(t,e)
if u is not marked:
mark u
enqueue u onto Q
return none
C# code which I wrote using this code:
public static bool newCheckPath(int source, int dest)
{
Queue<PathRecord> Q = new Queue<PathRecord>();
Q.Enqueue(pathTable[source]);
pathTable[source].markVisited();
while (Q.Count != 0)
{
PathRecord t = Q.Dequeue();
if (t.pathID == pathTable[dest].pathID)
{
return true;
}
else
{
string connectedPaths = pathTable[t.pathID].connectedPathID;
for (int x = 0; x < connectedPaths.Length && connectedPaths != "00"; x = x + 3)
{
int nextNode = Convert.ToInt32(connectedPaths.Substring(x, 2));
PathRecord u = pathTable[nextNode];
if (!u.wasVisited())
{
u.markVisited();
Q.Enqueue(u);
}
}
}
}
return false;
}
This code runs just fine, however, it only tells me if a path exists. That doesn't really work for me. Ideally what I would like to do is in the block "if (t.pathID == pathTable[dest].pathID)" I would like to have either a list or a way to see what nodes I had to pass through to get from the source and destination, such that I can process those nodes there, rather than returning a list to process elsewhere. Any ideas on how i could make that change?

The most effective solution, if you're willing to use static memory allocation (or automatic, as I seem to recall that the C++ term is), is to declare a fixed-size int array (of size 41, if you're absolutely certain that the number of nodes will never exceed 40). By using two indices to indicate the start and end of the queue, you can use this array as a ring buffer, which can act as the queue in a breadth-first search.
Alternatively: Since the number of nodes is so small, Bellman-Ford may be fast enough. The algorithm is simple to implement, does not use recursion, and the required extra memory is only a distance (int, or even byte in your case) and a predecessor id (int) per node. The running time is O(VE), alternatively O(V^3), where V is the number of nodes and E is the number of edges.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Need suggestion to improve speed for word break (dynamic programming) - c++

Related

C++ - checking a string for all values in an array

how to find set of distinct strings from a given string after cyclic shifts?

C++ if statement order

Why is my binary search so insanely slow compared to an iterative?

Low Memory Shortest Path Algorithm

Categories

Resources