Suffix tree vs Suffix array for LCS - c++

I'm working on a program to find the longest common substring between multiple strings. I've lowered my approach down to either using suffix array's or a suffix tree. I want to see which is the better approach (if there is one) and why. Also for suffix array's I've seen a few algorithms for two strings but not any for more then two strings. Any solid examples would be appreciated, thanks again for the advice!
Note: I didn't see any other questions that specifically addressed this issue, but if they exist please point me in that direction!

If you have a substring that occurs in all sequences, then in a suffix array, the pointers to each occurrence of that substring must sort close together. So you could attempt to find them by moving a window along the suffix array, where the window is just large enough to contain at least one occurrence of each sequence. You could do this in linear time by maintaining a table that tells you, for each sequence, how many times that sequence occurs within that window. Then when you move the rear end of the window forwards decrement the count for the sequence associated with the pointer you have just skipped over and, if necessary, move the forward end of the window just far enough to pick up a new occurrence of this sequence and update the table.
Now you need to be able to find the length of the common prefix shared by all substrings starting at the pointers in the window. This should be the minimum LCP value occurring between the pointers in the window. If you use a red-black tree, such as a Java Treeset, with a key which consists of the LCP value as most significant component and some tie-breaker such as the pointer itself as a less significant component, then you can maintain the minimum LCP value within the window at a cost of roughly log window size per window adjustment.

Related

Quick method for search a value in a (sorted) circular data structure

I'm looking for an algorithm similar to binary search but which works with data structures that are circular in nature, like a circular buffer for example. I'm working on a problem which is quite complicated, but I's able to strip it down, so it's easier to describe (and, I hope, easier to find a solution).
Let's say we have got an array of numbers with both its ends connected and an view window which can move forward and backward and which can get a value from the array (it's something like a C++ iterators which can go forward and backward). One of the values in the array is zero, which is our "sweet point" we want to find.
What we know about values in the array are:
they are sorted, which means when we move our window forward, the numbers grow (and vice versa),
they are not evenly spaced: if for example we read "16", it doesn't mean if we go 16 elements backward, we reach zero,
at last but not least: there is a point in the array where, up to that point values are positive, but after that point they are "flipped over" and start at a negative value (it is something like if we were adding ones to an integer variable until the counter goes around)
The last one is where my first approach to the problem with binary search fails. Also, if I may add, the reading a value operation is expensive, so the less often it is done the better.
PS: I'm looking for C++ code, but if You know C#, Java, JavaScript or Python and You like to write the algorithm in one of those languages, then it's no problem :).
If I understand correctly, you have an array with random access (if only sequential is allowed, the problem is trivial; that "window" concept does not seem relevant), holding a sequence of positive then negative numbers with a zero in between, but this sequence is rotated arbitrarily. (Seeing the array as a ring buffer just obscures the reasoning.)
Hence you have three sections, with signs +-+ or -+-, and by looking at the extreme elements, you can tell which of the two patterns holds.
Now the bad news: no dichotomic search can work, because whatever the order in which you sample the array, you can always hit elements of the same sign, except in the end (in the extreme case of a single element of opposite sign).
This contrasts with a standard dichotomic case that would correspond to a +- or -+ pattern: hitting two elements of the same sign allows you to discard the whole section in-between.
If the positive and negative subsequences are known to have length at least M, by sampling every M/2 element you will certainly find a change of sign and can start two dichotomies.
You can solve your problem using a galloping (exponential) search.
For simplicity I assume there are no duplicate items.
Start from the back and progress to the left in direction of smaller values. You begin with a jump of one index to the left, each next jump is exponentially bigger. With each jump to the left you should find a smaller value. If you encounter a greater value that means that zero is somewhere between the last two visited indexes. The only case when you will never encounter a greater value is when the zero is exactly at the beginning of the array.
After the jump from index i to i-j that jumped over zero, you've got a range in which zero resides. Since the jump was too far, try jumping from i to i-j/2. If that's still too far (overjumped zero) you try i-j/4 and so on. So this time each jump tried is exponentially smaller. With each step you divide the possible range where zero resides by half. On the other hand, if i-j is too far, but i-j/2 is too near (not reached zero yet), you try i-j/2-j/4. I hope you get the idea now.
This has O(lg n) complexity.

Matches overlapping lookahead on LZ77/LZSS with suffix trees

Background: I have an implementation of a generic LZSS backend on C++ (available here. The matching algorithm I use in this version is exceedingly simple, because it was originally meant to compress relatively small files (at most 64kB) for relatively ancient hardware (specifically, the Mega Drive/Sega Genesis, where 64kB is the entirety of main RAM).
Nevertheless, some files take far too long to compress on my implementation, on the order of minutes. The reason is twofold: the naïve matching algorithm takes most of the time, but this happens specifically because I construct a compression graph from the file to achieve optimal compression. Looking on the profiler, most of the time is spent looking for matches, dwarfing even the quadratic size of the resulting graph.
For some time, I have been studying several potential replacements; one that drew my attention was dictionary-symbolwise flexible parsing using multilayer suffix trees. The multilayer part is important because one of the variants of LZSS I am interested in uses variable size encodings for (position, length).
My current implementation allows matches in the sliding window to overlap the look-ahead buffer, so that this input:
aaaaaaaaaaaaaaaa
can be directly encoded as
(0,'a')(1,0,15)
instead of
(0,'a')(1,0,1)(1,0,2)(1,0,4)(1,0,8)
Here, (0,'a') means encoding character 'a' as a literal, while (1,n,m) means 'copy m characters from position n'.
The question: Having said all that, here is my problem: Every resource I found on suffix trees seem to imply that they can't handle the overlapping case, and instead only allows you to find non-overlapping matches. When suffix trees were involved, research papers, books and even some implementations gave compression examples without the overlap as if they were perfect compression (I would link to some of these but my reputation does not allow it). Some them even mentioned that overlaps could be useful when describing the base compression schemes, but were strangely silent on the matter when discussing suffix trees.
Since the suffix tree needs to be augmented to store offset information anyway, this seems like a property that could be checked while looking for a match — you would filter out any matches that start on the look-ahead buffer. And the way the tree is constructed/updated would mean that if an edge takes you to a node that corresponds to a match starting on the look-ahead, you return the previous node instead as any further descendants will also be on the look-ahead buffer.
Is my approach wrong or incorrect? Is there an implementation or discussion of LZ77/LZSS with suffix trees that mentions matches overlapping the look-ahead buffer?
As I understand it, given a suffix tree, we are interested (roughly) in computing, for each suffix S, which longer suffix has the longest common prefix with S.
Add a reference from each tree node to the descendant leaf with the longest suffix (linear time with DFS). From each leaf, walk rootward, examining the new references, stopping if a longer suffix is found. The running time of the latter step is linear, because each tree edge is examined exactly once.
Life with a bounded window is unfortunately more difficult. Instead of propagating one reference, we propagate several. To compute the set of suffixes referenced from a node, we merge them in sorted order by length. Then whenever we have suffixes of lengths x > y > z, if x - z < 8192 (e.g.), then we can drop y, because all three are equally good for all suffixes with which the current node is the leafmost common ancestor, and if y is in window, then either x or z is. Since the window is a large fraction of the total file, each node will have at most a handful of references.
If you want to look back the shortest distance possible, then there's an O(n log^2 n)-time algorithm (probably improvable to O(n log n) with various hard-to-implement magic). In the course of the algorithm, we construct, for each node, a binary search tree of the descendant suffixes by length, then do next-longest lookups. To construct the node's tree from its childrens', start with the largest child tree and insert the elements from the others. By a heavy path argument, each length is inserted O(log n) times.

Finding path in a char array

I am working on a project and I've found myself in a position where I have a n x n char array of signs a,b or c I have to check if there is a path of b's between the first and the last row.
Example YES input:
I am stuck at this point? Should I adapt some well-known algorithm for graph searching or is there any better way of solving this problem? Should I add a bool array to mark which cell I have visited?
Thanks in advance for your time!
Yes, you should adopt a graph algorithm for finding the path from a source to the target. In your case you have multiple sources (all 'b's in the first row) and multiple targets ('b's in the last row).
Shortest path on an unweighted graph can be solved pretty efficiently by the easily implemented BFS. Only difference to handle multiple sources is to initialize the queue with all the 'b's on the first line (and not a single node).
In your graph every 'b' cell is a node, there is an edge between every two adjacent 'b' cells.
Note that BFS is complete (always finds a solution if one exists) and optimal (finds shortest path).
The easiest way to do this is to allocate an equally sized, zero filled 2D array, mark the start points, and do a flood fill using the char array as a guide. When the flood fill terminates, you can easily check whether an end point has been marked.
A flood fill may be implemented in several ways, how you do it doesn't really matter as long as your problem size is small.
Generally, the easiest way is to do it in a recursive fashion. The only problem with a recursive flood fill is the huge recursion depth that can result, so it really depends on the problem size whether a recursive version is applicable.
If time is not important, you may simply do it iteratively, going through the entire array several times, marking points that have marked neighbours and are bs, until an iteration does not mark any point.
If you need to handle huge arrays efficiently, you should go for a breadth-first flood fill, keeping a queue of frontier pixels which you process in a first-in-first-out manner.

string cursors/markers in C++ strings

I am working with some big (megabytes) strings, and I need to modify them by inserting and removing characters at different locations.
To make it more efficient, instead of searching insertion/deletion points every time, I would like to have "cursors" or "tags" which are still valid if text is inserted (i.e. they are moved accordingly), and are still valid if the removed text is not "enclosing" the cursor position (i.e. a cursor becomes invalid only if it was in the removed substring, other cursors are moved accordingly).
I do not need to operate on the string concurrently, but insertion/deletion operations happen always one at time.
Do you know how this can be done with standard C++, boost or a portable, lightweight library?
If the number of insertion points will be relatively small, why not just keep a list (or array) of your insertion points, ordered by and including their offset into the data string.
Then, any time you insert/remove some text, simply pass through that list and adjust any insertion points that are past the offset of the modification, either up or down by the size of the insertion/removal.
Of course, you'd have to decide what it meant to have a modification "hit" one of your insertion points (e.g. what to do if a deleted range includes one or more insertion points), but that'd depend on what you're trying to maintain those markers for.
Maybe you can use special keywords in your text, which you match and replace using regular expressions (regex): http://www.cplusplus.com/reference/regex/
Be careful which keywords you use, because they could potentially occur naturally in the string.

Balancing KD Tree

So when balancing a KD tree you're supposed to find the median and then put all the elements that are less on the left subtree and those greater on the right. But what happens if you have multiple elements with the same value as the median? Do they go in the left subtree, the right or do you discard them?
I ask because I've tried doing multiple things and it affects the results of my nearest neighbor search algorithm and there are some cases where all the elements for a given section of the tree will all have the exact same value and so I don't know how to split them up in that case.
It does not really matter where you put them. Preferably, keep your tree balanced. So place as many on the left as needed to keep the optimal balance!
If your current search radius touches the median, you will have to check the other part, that's all you need to handle tied objects on the other side. This is usually cheaper than some complex handling of attaching multiple elements anywhere.
When doing a search style algorithm, it is often a good idea to put elements equal to your median on both sides of the median.
One method is to put median equaling elements on the "same side" as where they where before you did your partition. Another method is to put the first one on the left, and the second one on the right, etc.
Another solution is to have a clumping data structure that just "counts" things that are equal instead of storing each one individually. (if they have extra state, then you can store that extra state instead of just a count)
I don't know which is appropriate in your situation.
That depends on your purpose.
For problems such as exact-matching or range search, possibility of repetitions of the same value on both sides will complicate the query and repetition of the same value on both leaves will add to the time-complexity.
A solution is storing all of the medians (the values that are equal to the value of median) on the node, neither left nor right. Most variants of kd-trees store the medians on the internal nodes. If they happen to be many, you may consider utilizing another (k-1)d tree for the medians.