how to find the indexes of all matching substring using suffix tree? - c++

I created a suffix tree from this amazing answer. It works like a charm!
For now, if I look for "cat" in "This cat is a pretty cat", it will return 5 as "cat" first appearance as for starting index 5.
But I can't find a way to keep track of all the suffixes in the algorithm to create. So basically, I can find the index of the first match, but not all the different occurrences.
For now, I have:
class Edge
{
int textIndexFrom;
Node* nodefrom;
Node* nodeTo;
int textIndexTo;
}
class Node
{
std::map<char,Edge*> m_childEdges;
Edge* m_pParentEdge;
Node* m_pLinkedNode;
}
I just put the relevant variables in the code above. To store the different starting positions, I imagine a std::vector is needed in Edge, but I don't see when to add a new index. We might use the active point but with the suffix links, it becomes tricky.
Could someone explain?

I assume you constructed a suffix tree for the string S$ where $ is some special character not present in S. The $ char ensures that each suffix has its own leaf in the tree. The number of occurances of word w in S is the number of leaves in the subtree of w.
I think that storing all starting positions in each edge/node would require quadratic memory. For example if T happens to be perfectly balanced binary tree then on level 0 (root) you have n entries, on level 1 you have 2 * n/2 entries and so on. After summing it gives us n^2. It requires proving so please correct me if I'm wrong.
Instead I think its better to keep all the leaves in a list in order they appear in dfs traversal of the tree (left to right if you draw a picture of the tree). And in every node v keep 2 pointers to the elements of that list. First should point to the first leaf of v's subtree and second to the last leaf of v's subtree. All that can be computed by simple dfs procedure. Now if for example 'cat' happens to be in the middle of some edge then go through that edge to some node v and get leaves from that node. In addition in every leaf you should store the length of the path from root to that leaf. It will help you find the position of that particular 'cat' occurance.

Walk the entire cat subtree. Each leaf in that subtree corresponds to a suffix that begins with cat. If you know the length of the string you've matched so far and the length of the string, each time you encounter a leaf you can do a subtraction to find the index of the corresponding occurrence of cat.

Related

How to Solve this Modified Word Ladder Problem?

Here is the word ladder problem:
Given two words (beginWord and endWord), and a dictionary's word list, find the length of the shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Now along with the modification, we are allowed to delete or add an element.
We have to find minimum steps if possible to convert string1 to string2.
This problem has a nice BFS structure. Let's illustrate this using the example in the problem statement.
beginWord = "hit",
endWord = "cog",
wordList = "hot","dot","dog","lot","log","cog"
Since only one letter can be changed at a time, if we start from "hit", we can only change to those words which have exactly one letter different from it (in this case, "hot"). Putting in graph-theoretic terms, "hot" is a neighbor of "hit". The idea is simply to start from the beginWord, then visit its neighbors, then the non-visited neighbors of its neighbors until we arrive at the endWord. This is a typical BFS structure.
But now since we are allowed to add/delete also how should I proceed further?

Checking if there exists a string of directions that always leads to same vertex

There is a graph with n vertices. Each one has 4 edges, one for each side of the world. All of them are directed. I have to write a program that checks if there exists a string of directions that always leads to same vertex, no matter where you start.
For example:
example 1
If you go S W S you always will be in 3.
example 2
Such string of directions doesn't exist.
I have an idea how to do it but I would need a bool array with size 2^n. However, program has to work for n up to size 1000.
What's the best way to do it?
The sort of string you're describing is called a synchronizing word. If you are just trying to test whether such a word exists, there's a polynomial-time algorithm described in these lecture slides. Intuitively, for each pair of nodes u and v, you build a new graph where the start node is the pair {u, v}. Each node has a transition defined on each character c to the set {t(c, u), t(c, v)}, where t(c, u) represents the node transitioned to by reading character c in state u. You can expand out this graph using DFS or BFS. The original graph has a synchronizing word if and only if for each pair of nodes {u, v}, the above process produces a graph that has a path from {u, v} to some singleton node.
If you search online, you can find all sorts of other readings on this topic. Hopefully the terminology and the above links can help get you started!

Maximum Bipartite Matching C++

I'm solving a matching problem with two vectors of a class
class matching
{
public:
int n;
char match;
};
This is the algorithm I'm trying to implement:
int augment(vector<matching> &left, vector<matching> &right)
{
while(there's no augmenting path)
if(condition for matching)
<augment>
return "number of matching";
}
For the rough matching, if left[i] matches with right[j], then left[i].n = j, left[i].match ='M' , right[j].n = i and right[j].match = 'M' and the unmatched ones have members n = -1 and match = 'U'
While finding the augmenting paths, if one exists for another (i, j), then we change the member match of the one being unmatched from 'M' to 'U' and its n = -1 and the two matched with the augmenting path have their members match changed to 'A' while we change their members n according to their indices.
I don't know if this is the right approach to solving this, this is my first attempt on maximum matching and I've read a lot of articles and watched tutorials online and I can't get my 'code' to function appropriately.
I do not need a code, I can write my code. I just want to understand this algorithm step by step. If someone can give me an algorithm like the one I was trying above, I would appreciate it. Also, if I have been going the wrong direction since, please correct me.
I am not sure if you are finding the augmenting paths correctly. I suggest the following approach.
Find an initial matching in a greedy way. To obtain this we travel through every vertex in the left side and greedily try to match it with some free (unmatched) vertex on the right side.
Try to find an augmenting path P in the graph. For this we need to do a breadth-first search starting from all the free vertices on the left side and alternating through matched and unmatched edges in the search. (i.e. the second level contains all the right side vertices adjacent to level-1
vertices, the third level contains all the left side vertices that are
matched to level-2 vertices, the fourth level contains all the right side
vertices adjacent to level-3 vertices etc). We stop the search when we
visit a free vertex in any future level and compute the augmenting path P
using the breadth-first search tree computed so far.
If we can find an augmenting path P in the previous step: Change the matched and unmatched edges in P to unmatched and matched edges respectively and goto step 2.
Else: The resulting matching obtained is maximum.
This algorithm requires a breadth-first search for every augumentation and so it's worst-case complexity is O(nm). Although Hopcroft-Karp algorithm can perform multiple augmentations for each breadth-first search and has a better worst-case complexity, it
seems (from the Wikipedia article) that it isn't faster in practice.

Algorithm to print to screen path(s) through a text maze

For my C++ assignment, I'm basically trying to search through a chunk of text in a text file (that's streamed to my vector vec) beginning at the second top character on the left. It's for a text maze, where my program in the end is supposed to print out the characters for a path through it.
An example of a maze would be like:
###############
Sbcde####efebyj
####hijk#m#####
#######lmi#####
###############
###############
###############
###############
###############
###############
###############
###############
###############
###############
###############
Where '#' is an unwalkable wall and you always begin on the left at the second top character. Alphabetical characters represent walkable squares. Exit(s) are ALWAYS on the right. The maze is always a 15x15 size in a maze.text file. Alphabetical characters repeat within the same maze, but not directly beside each other.
What I'm trying to do here is: if a square next to the current one has an alphabetical character, add it to the vector vec, and repeat this process until I get to the end of the maze. Eventually I am supposed to make this more complicated by printing to the screen multiple paths that exist in some mazes.
So far I have this for the algorithm itself, which I know is wrong:
void pathcheck()
{
if (isalpha(vec.at(x)) && !(find(visited.begin(), visited.end(), (vec.at(x))) != visited.end()) )
{
path.push_back(vec.at(x));
visited.push_back(vec.at(x));
pathcheck(vec.at(x++));
pathcheck(vec.at(x--));
pathcheck(vec.at(x + 16));
pathcheck(vec.at(x - 16));
}
}
visited is my vector keeping track of the visited squares.
How would I update this so it actually works, and eventually so I can manage more than one path (i.e. if there were 2 paths, the program would print to the screen both of them)? I recall being told that I may need another vector/array that keeps track of squares that I've already visited/checked, but then how would I implement that here exactly?
You're on the right track. When it comes to mazes, the typical method of solving is through either a depth-first search (the most efficient solution for finding some path) or breadth-first search (less efficient, but is guarenteed to find the optimal path). Since you seem to want to do an exhaustive search, these choices are basically interchangeable. I suggest you read up on them:
http://en.wikipedia.org/wiki/Depth-first_search
http://en.wikipedia.org/wiki/Breadth-first_search
Basically, you will need to parse your maze and represent it as a graph (where each non "#" is a node and each link is a walkable path). Then, you keep a list of partial paths (i.e. a list of nodes, in the order you visited them, for example, [S, b, c] is the partial path starting from S and ending at c). The main idea of DFS and BFS is that you have a list of partial paths, and one by one you remove items from the list, generate all possible partial paths leading from that partial path, then place them in the list and repeat. The main difference between DFS and BFS is that DFS implements this list as a stack (i.e. new items have greatest priority) and BFS uses a queue (i.e. new items have lowest priority).
So, for your maze using DFS it would work like this:
Initial node is S, so your initial path is just [S]. Push [S] into your stack ([ [S] ]).
Pop the first item (in this case, [S]).
Make a list of all possible nodes you can reach in 1 move from the current node (in your case, just b).
For each node from step 3, remove any nodes that are part of your current partial path. This will prevent loops. (i.e. for partial path [S, b], from b we can travel to c and to S, but S is already part of our partial path so returning is pointless)
If one of the nodes from step 4 is the goal node, add it to your partial path to create a completed path. Print the path.
For each node from step 4 that IS NOT the goal node, generate a new partial path and push it into the stack (i.e. for [S], we generate [S, b] and push it into the stack, which now should look like [ [S, b] ])
Repeat steps 2 through 6 until the stack is empty, meaning you have traversed every possible path from the starting node.
NOTE: in your example there are duplicate letters (for example, three "e"s). For your case, maybe make a simple "Node" class that includes a variable to hold the letter. That way each "e" will have it's own instance and the pointers will be different values letting you easily tell them apart. I don't know C++ exactly, but in pseudo code:
class Node:
method Constructor(label):
myLabel = label
links = list()
method addLink(node):
links.add(node)
You could read every character in the file and if it is not "#", create a new instance of Node for that character and add all the adjacent nodes.
EDIT: I've spent the last 3 years as a Python developer and I've gotten a bit spoiled. Look at the following code.
s = "foo"
s == "foo"
In Python, that assertion is true. "==" in Python compares the string's content. What I forgot from my days as a Java developer is that in many languages "==" compares the string's pointers. That's why in many languages like Java and C++ the assertion is false because the strings point to different parts of memory.
My point is that because this assertion is not true, you COULD forgo making a Node class and just compare the characters (using ==, NOT using strcmp()!) BUT this code could be a bit confusing to read and must be documented.
Overall, I'd use some sort of Node class just because it's fairly simple to implement and results in more readable code AND only requires parsing your maze once!
Good Luck

number of paths in graph

how could the number of paths in a directed graph calculated? Are there any algorithms for this purpose?
Best wishes
EDIT: The graph is not a tree.
Let A be the adjacency matrix of a graph G. Then A^n (i.e. A multiplied n times with itself) has the following interesting property:
The entry at position (i,j) of A^n equals the number of different paths of length n from vertex i to vertex j.
Hence:
represent the graph as an adjacency matrix A
multiply A it with itself repeatedly until you get bored
in each step: compute the sum of all matrix elements and add it to the result, starting at 0
It might be wise to first check whether G contains a cycle, because in this case it contains infinitely many paths. In order to detect cycles, set all edge weights to -1 and use Bellman-Ford.
All the search hits I see are for the number of paths from a given node to another given node. But here's an algorithm that should find the total number of paths anywhere in the graph, for any acyclic digraph. (If there are cycles, there are an infinite number of paths unless you specify that certain repetitive paths are excluded.)
Label each node with the number of paths which end at that node:
While not all nodes are labeled:
Choose an unlabeled node with no unlabeled ancestors.
(An implementation might here choose any node, and recursively
process any unlabeled ancestors of that node first.)
Label the node with one plus the sum of the labels on all ancestors.
(If a node has no ancestors, its label is simply 1.)
Now just add the labels on all nodes.
If you don't want to count "length zero" paths, subtract the number of nodes.
You can use depth-first search. However, you don't terminate the search when you find a path from start to destination, the way depth-first search normally does. Instead, you just add to the count of paths and return from that node as if it were a dead end. This is probably not the fastest method, but it should work.
You could also potentially use breadth-first search, but then you need to work out a way to pass information on path counts forward (or backwards) through the tree as you search it. If you could do that, it'd probably be much faster.
Assuming the graph is acyclic (a DAG), you can make a topological sorting of the vertices and than do dynamic programming to compute the number of distinct paths. If you want to print all the paths, there is not much use in discussing big O notation since the number of paths can be exponential on the number of vertices.
Pseudo-code:
paths := 0
dp[i] := 0, for all 0 <= i < n
compute topological sorting and store on ts
for i from n - 1 to 0
for all edges (ts[i], v) // outbound edges from ts[i]
dp[ts[i]] := 1 + dp[ts[i]] + dp[v]
paths := paths + dp[ts[i]]
print paths
Edit: Bug on the code
I don't believe there's anything faster than traversing the graph, starting at the root.
In pseudo-code -
visit(node) {
node.visited = true;
for(int i = 0; i < node.paths.length; ++i) {
++pathCount;
if (!node.paths[i].visited)
visit(node.paths[i]);
}
}
If it is realy a tree, the number of paths equals the number of nodes-1 if you count paths to internal nodes. If you only count paths to leaves, the number of paths equals the number of leaves. So the fact that we're talking about trees simplifies matters to just counting nodes or leaves. A simple BFS or DFS algorithm will suffice.
admat gives the length 1 paths between vertices;
admat^2 gives the length 2 paths between vertices;
admat^3 gives the length 3 paths between vertices;
Spot the pattern yet ?
If graph is not a tree, there will be infinite paths - walk a loop any times.