Generalised suffix tree traversal to find longest common substring

Generalised suffix tree traversal to find longest common substring - c++

I'm working with suffix trees. As far as I can tell, I have Ukkonen's algorithm running correctly to build a generalised suffix tree from an arbitrary number of strings. I'm now trying to implement a find_longest_common_substring() method to do exactly that. For this to work, I understand that I need to find the deepest shared edge (with depth in terms of characters, rather than edges) between all strings in the tree, and I've been struggling for a few days to get the traversal right.
Right now I have the following in C++. I'll spare you all my code, but for context, I'm keeping the edges of each node in an unordered_map called outgoing_edges, and each edge has a vector of ints recorded_strings containing integers identifying the added strings. The child field of an edge is the node it is going to, and l and r identify its left and rightmost indices, respectively. Finally, current_string_number is the current number of strings in the tree.
SuffixTree::Edge * SuffixTree::find_deepest_shared_edge(SuffixTree::Node * start, int current_length, int &longest) {
Edge * deepest_shared_edge = new Edge;
auto it = start->outgoing_edges.begin();
while (it != start->outgoing_edges.end()) {
if (it->second->recorded_strings.size() == current_string_number + 1) {
int edge_length = it->second->r - it->second->l + 1;
int path_length = current_length + edge_length;
find_deepest_shared_edge(it->second->child, path_length, longest);
if (path_length > longest) {
longest = path_length;
deepest_shared_edge = it->second;
}
}
it++;
}
return deepest_shared_edge;
}
When trying to debug, as best I can tell, the traversal runs mostly fine, and correctly records the path length and sets longest. However, for reasons I don't quite understand, in the innermost conditional, deepest_shared_edge sometimes seems to get updated to a mistaken edge. I suspect I maybe don't quite understand how it->second is updated throughout the recursion. Yet I'm not quite sure how to go about fixing this.
I'm aware of this similar question, but the approach seems sufficiently different that I'm not quite sure how it applies here.
I'm mainly during this for fun and learning, so I don't necessarily need working code to replace the above - pseudocode or just any explanation of where I'm confused would be just as well.

Your handling of deepest_shared_edge is wrong. First, the allocation you do at the start of the function is a memory leak, since you never free the memory. Secondly, the result of the recursive call is ignored, so whatever deepest edge it finds is lost (although you update the depth, you don't keep track of the deepest edge).
To fix this, you should either pass deepest_shared_edge as a reference parameter (like you do for longest), or you can initialize it to nullptr, then check the return from your recursive call for nullptr and update it appropriately.

Related

Binary search of range in std::map slower than map::find() search of whole map

Background: I'm new to C++. I have a std::map and am trying to search for elements by key.
Problem: Performance. The map::find() function slows down when the map gets big.
Preferred approach: I often know roughly where in the map the element should be; I can provide a [first,last) range to search in. This range is always small w.r.t. the number of elements in the map. I'm interested in writing a short binary search utility function with boundary hinting.
Attempt: I stole the below function from https://en.cppreference.com/w/cpp/algorithm/lower_bound and did some rough benchmarks. This function seems to be much slower than map::find() for maps large and small, regardless of the size or position of the range hint provided. I replaced the comparison statements (it->first < value) with a comparison of random ints and the slowdown appeared to resolve, so I think the slowdown may be caused by the dereferencing of it->first.
Question: Is the dereferencing the issue? Or is there some kind of unnecessary copy/move action going on? I think I remember reading that maps don't store their element nodes sequentially in memory, so am I just getting a bunch of cache misses? What is the likely cause of the slowdown, and how would I go about fixing it?
/* #param first Iterator pointing to the first element of the map to search.
* #param distance Number of map elements in the range to search.
* #param key Map key to search for. NOTE: Type validation is not a concern just yet.
*/
template<class ForwardIt, class T>
ForwardIt binary_search_map (ForwardIt& first, const int distance, const T& key) {
ForwardIt it = first;
typename std::iterator_traits<ForwardIt>::difference_type count, step;
count = distance;
while (count > 0) {
it = first;
step = count/2;
std::advance(it, step);
if (it->first < value) {
first = ++it;
count -= step + 1;
}
else if (it->first > value)
count = step;
else {
first = it;
break;
}
}
return first;
}

There is a reason that std::map::find() exists. The implementation already does a binary search, as the std::map has a balanced binary tree as implementation.
Your implementation of binary search is much slower because you can't take advantage of that binary tree.
If you want to take the middle of the map, you start with std::advance it takes the first node (which is at the leaf of the tree) and navigates through several pointers towards what you consider to be the middle. Afterwards, you again need to go from one of these leaf nodes to the next. Again following a lot of pointers.
The result: next to a lot more looping, you get a lot of cache misses, especially when the map is large.
If you want to improve the lookups in your map, I would recommend using a different structure. When ordering ain't important, you could use std::unordered_map. When order is important, you could use a sorted std::vector<std::pair<Key, Value>>. In case you have boost available, this already exists in a class called boost::container::flat_map.

How do I calculate the time complexity of the following function?

Here is a recursive function. Which traverses a map of strings(multimap<string, string> graph). Checks the itr -> second (s_tmp) if the s_tmp is equal to the desired string(Exp), prints it (itr -> first) and the function is executed for that itr -> first again.
string findOriginalExp(string Exp){
cout<<"*****findOriginalExp Function*****"<<endl;
string str;
if(graph.empty()){
str ="map is empty";
}else{
for(auto itr=graph.begin();itr!=graph.end();itr++){
string s_tmp = itr->second;
string f_tmp = itr->first;
string nll = "null";
//s_tmp.compare(Exp) == 0
if(s_tmp == Exp){
if(f_tmp.compare(nll) == 0){
cout<< Exp <<" :is original experience.";
return Exp;
}else{
return findOriginalExp(itr->first);
}
}else{
str="No element is equal to Exp.";
}
}
}
return str;
}
There are no rules for stopping and it seems to be completely random. How is the time complexity of this function calculated?

I am not going to analyse your function but instead try to answer in a more general way. It seems like you are looking for an simple expression such as O(n) or O(n^2) for the complexity for your function. However, not always complexity is that simple to estimate.
In your case it strongly depends on what are the contents of graph and what the user passes as parameter.
As an analogy consider this function:
int foo(int x){
if (x == 0) return x;
if (x == 42) return foo(42);
if (x > 0) return foo(x-1);
return foo(x/2);
}
In the worst case it never returns to the caller. If we ignore x >= 42 then worst case complexity is O(n). This alone isn't that useful as information for the user. What I really need to know as user is:
Don't ever call it with x >= 42.
O(1) if x==0
O(x) if x>0
O(ln(x)) if x < 0
Now try to make similar considerations for your function. The easy case is when Exp is not in graph, in that case there is no recursion. I am almost sure that for the "right" input your function can be made to never return. Find out what cases those are and document them. In between you have cases that return after a finite number of steps. If you have no clue at all how to get your hands on them analytically you can always setup a benchmark and measure. Measuring the runtime for input sizes 10,50, 100,1000.. should be sufficient to distinguish between linear, quadratic and logarithmic dependence.
PS: Just a tip: Don't forget what the code is actually supposed to do and what time complexity is needed to solve that problem (often it is easier to discuss that in an abstract way rather than diving too deep into code). In the silly example above the whole function can be replaced by its equivalent int foo(int){ return 0; } which obviously has constant complexity and does not need to be any more complex than that.

This function takes a directed graph and a vertex in that graph and chases edges going into it backwards to find a vertex with no edge pointing into it. The operation of finding the vertex "behind" any given vertex takes O(n) string comparisons in n the number of k/v pairs in the graph (this is the for loop). It does this m times, where m is the length of the path it must follow (which it does through the recursion). Therefore, it has time complexity O(m * n) string comparisons in n the number of k/v pairs and m the length of the path.
Note that there's generally no such thing as "the" time complexity for just some function you see written in code. You have to define what variables you want to describe the time in terms of, and also the operations with which you want to measure the time. E.g. if we want to write this purely in terms of n the number of k/v pairs, you run into a problem, because if the graph contains a suitably placed cycle, the function doesn't terminate! If you further constrain the graph to be acyclic, then the maximum length of any path is constrained by m < n, and then you can also get that this function does O(n^2) string comparisons for an acyclic graph with n edges.

You should approximate the control flow of the recursive calling by using a recurrence relation. It's been like 30 years since I took college classes in Discrete Math, but generally you do like pseuocode, just enough to see how many calls there are. In some cases just counting how many are on the longest condition on the right hand side is useful, but you generally need to plug one expansion back in and from that derive a polynomial or power relationship.

Divide function returns wrong value

I have a function which will supposedly check if there's an indices i such that it is equal to v[i]. V is a strictly ascending ordered vector. It needs to be done in O(logn) and I thought about divide. I was never really familiar with recursion. I wrote this code and I don't really know why it won't work. Something's missing. If I put cout << mid instead of return it will show the right value, but I guess that's not the proper way to do it, frankly. In this stage the mid value returned is 7, and I don't know why.
Here's the code.
int customDivide(vector <int>& v, int left, int right)
{
if(left <= right)
{
int mid = (left+right)/2;
if(mid == v[mid]){
//cout<<mid<<" ";
return mid;
}
customDivide(v,left,mid-1);
customDivide(v,mid+1,right);
}
}

Two problems exist here, and I will attempt to explain them with an analogy of finding a lost dog in your neighborhood.
You are not returning a value from the function, unless you found the correct element immediately. You promise to return a value (an int) but you don't always do it.
This is like promising to send the dog owner a letter to indicate where their lost dog can be found. You check your garden and if you find the dog, you send a letter - that works. If you didn't find the dog, you go to your neighbors and have them promise to send you a letter if they find the dog, using the same method as you did (recursion). The problem is that in this case you are not reading or forwarding their letters (the return values from your two recursive function calls at the end) - you are just throwing these letters away. Worse, you are not actually sending back the letter to the guy looking for his dog if you didn't find it yourself (no return after the calls). Your code seems to assume that the neighbors will automatically send the letter to the dog owner - that is not how return works, it just sends the letter to the previous person in the chain (in code terms, the call stack), so if that person throws it into the trash right away, the system won't work.
You cannot get O(log(n)) performance if you unconditionally recurse to both sides.
If you always ask all your neighbors (and they ask all of theirs), you will have literally every person in the neighborhood looking for the dog. That's O(n). You must identify which of your two neighbors should look for the dog (e.g. by looking at the trail the dog left) and only ask that one. This way you halve the number of people that might have to search for the dog at each step, giving you O(log(n)) performance.
This "trail" is something you need to know beforehand. In your case, it is not clear what that could be - the dog could be anywhere (all elements could have random values) and you have no idea where to go looking. You need to figure out this detail of the task to get to your O(log(n)) time. It could be that vector elements are strictly increasing (see #Jarod42's comment), i.e. there are no duplicate elements and each one is larger than the previous one. In that case you can decide that only one of the two remaining halves can possibly contain what you are looking for, thus recursing there.
(Yes I know, the analogy breaks down unless your neighborhood is shaped like a binary tree with you at the top and a non-reciprocal definition of "neighbors".)

As a result of all the help involved here, I finally understood what the problem was.
1st of all - It didn't take advantage of the fact that the vector was strictly sorted.
2nd - Every function in the stack returns something now.
int customDivide(vector <int>& v, int left, int right)
{
if(left <= right)
{
int mid = (left+right)/2;
if(mid == v[mid])
return mid;
else if(v[mid] < mid)
return customDivide(v,mid+1,right);
else
return customDivide(v,left, mid-1);
}
return -1;
}
Thanks a lot for all your help!

Efficient intersection of two sets

I have two sets (or maps) and need to efficiently handle their intersection.
I know that there are two ways of doing this:
iterate over both maps as in std::set_intersection: O(n1+n2)
iterating over one map and finding elements in the other: O(n1*log(n2))
Depending on the sizes either of these two solution is significantly better (have timed it), and I thus need to either switch between these algorithm based on the sizes (which is a bit messy) - or find a solution outperforming both, e.g. using some variant of map.find() taking the previous iterator as a hint (similarly as map.emplace_hint(...)) - but I could not find such a function.
Question: Is it possible to combine the performance characteristics of the two solutions directly using STL - or some compatible library?
Note that the performance requirement makes this different from earlier questions such as
Efficient intersection of sets?

In almost every case std::set_intersection will be the best choice.
The other solution may be better only if the sets contain a very small number of elements.
Due to the nature of the log with base two.
Which scales as:
n = 2, log(n)= 1
n = 4, log(n)= 2
n = 8, log(n)= 3
.....
n = 1024 log(n) = 10
O(n1*log(n2) is significantly more complex than O(n1 + n2) if the length of the sets is more than 5-10 elements.
There is a reason such function is added to the STL and it is implemented like that. It will also make the code more readable.
Selection sort is faster than merge or quick sort for collections with length less than 20 but is rarely used.

For sets that are implemented as binary trees, there actually is an algorithm that combines the benefits of both the procedures you mention. Essentially, you do a merge like std::set_intersection, but while iterating in one tree, you skip any branches that are all less than the current value in the other.
The resulting intersection takes O(min(n1 log n2, n2 log n1, n1 + n2), which is just what you want.
Unfortunately, I'm pretty sure std::set doesn't provide interfaces that could support this operation.
I've done it a few times in the past though, when working on joining inverted indexes and similar things. Usually I make iterators with a skipTo(x) operation that will advance to the next element >= x. To meet my promised complexity it has to be able to skip N elements in log(N) amortized time. Then an intersection looks like this:
void get_intersection(vector<T> *dest, const set<T> set1, const set<T> set2)
{
auto end1 = set1.end();
auto end2 = set2.end();
auto it1 = set1.begin();
if (it1 == end1)
return;
auto it2 = set2.begin();
if (it2 == end2)
return;
for (;;)
{
it1.skipTo(*it2);
if (it1 == end1)
break;
if (*it1 == *it2)
{
dest->push_back(*it1);
++it1;
}
it2.skipTo(*it1);
if (it2 == end2)
break;
if (*it2 == *it1)
{
dest->push_back(*it2);
++it2;
}
}
}
It easily extends to an arbitrary number of sets using a vector of iterators, and pretty much any ordered collection can be extended to provide the iterators required -- sorted arrays, binary trees, b-trees, skip lists, etc.

I don't know how to do this using the standard library, but if you wrote your own balanced binary search tree, here is how to implement a limited "find with hint". (Depending on your other requirements, a BST reimplementation could also leave out the parent pointers, which could be a performance win over the STL.)
Assume that the hint value is less than the value to be found and that we know the stack of ancestors of the hint node to whose left sub-tree the hint node belongs. First search normally in the right sub-tree of the hint node, pushing nodes onto the stack as warranted (to prepare the hint for next time). If this doesn't work, then while the stack's top node has a value that is less than the query value, pop the stack. Search from the last node popped (if any), pushing as warranted.
I claim that, when using this mechanism to search successively for values in ascending order, (1) each tree edge is traversed at most once, and (2) each find traverses the edges of at most two descending paths. Given 2*n1 descending paths in a binary tree with n2 nodes, the cost of the edges is O(n1 log n2). It's also O(n2), because each edge is traversed once.

With regard to the performance requirement, O(n1 + n2) is in most circumstances a very good complexity so only worth considering if you're doing this calc in a tight loop.
If you really do need it, the combination approach isn't too bad, perhaps something like?
Pseudocode:
x' = set_with_min_length([x, y])
y' = set_with_max_length([x, y])
if (x'.length * log(y'.length)) <= (x'.length + y'.length):
return iterate_over_map_find_elements_in_other(y', x')
return std::set_intersection(x, y)
I don't think you'll find an algorithm that will beat either of these complexities but happy to be proven wrong.

Recursive Backtracking Sudoku Solver Problems, c++

It's my first time dealing with recursion as an assignment in a low level course. I've looked around the internet and I can't seem to find anybody using a method similar to the one I've come up with (which probably says something about why this isn't working). The error is a segmentation fault in std::__copy_move... which I'm assuming is something in the c++ STL.
Anywho, my code is as follows:
bool sudoku::valid(int x, int y, int value)
{
if (x < 0) {cerr << "No valid values exist./n";}
if (binary_search(row(x).begin(), row(x).end(), value))
{return false;} //if found in row x, exit, otherwise:
else if (binary_search(col(y).begin(), col(y).end(), value))
{return false;} //if found in col y, exit, otherwise:
else if (binary_search(box((x/3), (y/3)).begin(), box((x/3), (y/3)).end(), value))
{return false;} //if found in box x,y, exit, otherwise:
else
{return true;} //the value is valid at this index
}
int sudoku::setval(int x, int y, int val)
{
if (y < 0 && x > 0) {x--; y = 9;} //if y gets decremented past 0 go to previous row.
if (y > 8) {y %= 9; x++;} //if y get incremented past 8 go to next row.
if (x == 9) {return 0;} //base case, puzzle done.
else {
if (valid(x,y,val)){ //if the input is valid
matrix[x][y] = val; //set the element equal to val
setval(x,y++,val); //go to next element
}
else {
setval(x,y,val++); //otherwise increment val
if(val > 9) {val = value(x,y--); setval(x,y--,val++); }
} //if val gets above 9, set val to prev element,
} //and increment the last element until valid and start over
}
I've been trying to wrap my head around this thing for a while and I can't seem to figure out what's going wrong. Any suggestions are highly appreciated! :)

sudoku::setval is supposed to return an int but there are at least two paths where it returns nothing at all. You should figure out what it needs to return in those other paths because otherwise you'll be getting random undefined behavior.

Without more information, it's impossible to tell. Things like the data
structures involved, and what row and col return, for example.
Still, there are a number of obvious problems:
In sudoku::valid, you check for what is apparently an error
condition (x < 0), but you don't return; you still continue your
tests, using the negative value of x.
Also in sudoku:valid: do row and col really return references to
sorted values? If the values aren't sorted, then binary_search will
have undefined behavior (and if they are, the names are somewhat
misleading). And if they return values (copies of something), rather
than a reference to the same object, then the begin() and end()
functions will refer to different objects—again, undefined
behavior.
Finally, I don't see any backtracking in your algorithm, and I don't
see how it progresses to a solution.
FWIW: when I wrote something similar, I used a simple array of 81
elements for the board, then created static arrays which mapped the
index (0–80) to the appropriate row, column and box. And for each of
the nine rows, columns and boxes, I kept a set of used values (a
bitmap); this made checking for legality very trivial, and it meant that
I could increment to the next square to test just by incrementing the
index. The resulting code was extremely simple.
Independently of the data representation used, you'll need: some
"global" (probably a member of sudoku) means of knowing whether you've
found the solution or not; a loop somewhere trying each of the nine
possible values for a square (stopping when the solution has been
found), and the recursion. If you're not using a simple array for the
board, as I did, I'd suggest a class or a struct for the index, with a
function which takes care of the incrementation once and for all.

All of the following is for Unix not Windows.
std::__copy_move... is STL alright. But STL doesn't do anything by itself, some function call from your code would've invoked it with wrong arguments or in wrong state. You need to figure that out.
If you have a core dump from teh seg-fault then just do a pstack <core file name>, you will see the full call stack of the crash. Then just see which part of your code was involved in it and start debugging (add traces/couts/...) from there.
Usually you'll get this core file with nice readable names, but in case you don't you can use nm or c++filt etc to dismangle the names.
Finally, pstack is just a small cmd line utility, you can always load the binary (that produced the core) and the core file into a debugger like gdb, Sun Studio or debugger built into your IDE and see the same thing along with lots of other info and options.
HTH

It seems like your algorithm is a bit "brute forcy". This is generally not a good tactic with Constraint Satisfaction Problems (CSPs). I wrote a sudoku solver a while back (wish I still had the source code, it was before I discovered github) and the fastest algorithm that I could find was Simulated Annealing:
http://en.wikipedia.org/wiki/Simulated_annealing
It's probabilistic, but it was generally orders of magnitude faster than other methods for this problem IIRC.
HTH!

segmentation fault may (and will) happen if you enter a function recursively too many times.
I noted one scenario which lead to it. But I'm pretty sure there are more.
Tip: write in your words the purpose of any function - if it is too complicated to write - the function should probably be split...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Generalised suffix tree traversal to find longest common substring - c++

Related

Binary search of range in std::map slower than map::find() search of whole map

How do I calculate the time complexity of the following function?

Divide function returns wrong value

Efficient intersection of two sets

Recursive Backtracking Sudoku Solver Problems, c++

Categories

Resources