Matches overlapping lookahead on LZ77/LZSS with suffix trees - c++

Background: I have an implementation of a generic LZSS backend on C++ (available here. The matching algorithm I use in this version is exceedingly simple, because it was originally meant to compress relatively small files (at most 64kB) for relatively ancient hardware (specifically, the Mega Drive/Sega Genesis, where 64kB is the entirety of main RAM).
Nevertheless, some files take far too long to compress on my implementation, on the order of minutes. The reason is twofold: the naïve matching algorithm takes most of the time, but this happens specifically because I construct a compression graph from the file to achieve optimal compression. Looking on the profiler, most of the time is spent looking for matches, dwarfing even the quadratic size of the resulting graph.
For some time, I have been studying several potential replacements; one that drew my attention was dictionary-symbolwise flexible parsing using multilayer suffix trees. The multilayer part is important because one of the variants of LZSS I am interested in uses variable size encodings for (position, length).
My current implementation allows matches in the sliding window to overlap the look-ahead buffer, so that this input:
aaaaaaaaaaaaaaaa
can be directly encoded as
(0,'a')(1,0,15)
instead of
(0,'a')(1,0,1)(1,0,2)(1,0,4)(1,0,8)
Here, (0,'a') means encoding character 'a' as a literal, while (1,n,m) means 'copy m characters from position n'.
The question: Having said all that, here is my problem: Every resource I found on suffix trees seem to imply that they can't handle the overlapping case, and instead only allows you to find non-overlapping matches. When suffix trees were involved, research papers, books and even some implementations gave compression examples without the overlap as if they were perfect compression (I would link to some of these but my reputation does not allow it). Some them even mentioned that overlaps could be useful when describing the base compression schemes, but were strangely silent on the matter when discussing suffix trees.
Since the suffix tree needs to be augmented to store offset information anyway, this seems like a property that could be checked while looking for a match — you would filter out any matches that start on the look-ahead buffer. And the way the tree is constructed/updated would mean that if an edge takes you to a node that corresponds to a match starting on the look-ahead, you return the previous node instead as any further descendants will also be on the look-ahead buffer.
Is my approach wrong or incorrect? Is there an implementation or discussion of LZ77/LZSS with suffix trees that mentions matches overlapping the look-ahead buffer?

As I understand it, given a suffix tree, we are interested (roughly) in computing, for each suffix S, which longer suffix has the longest common prefix with S.
Add a reference from each tree node to the descendant leaf with the longest suffix (linear time with DFS). From each leaf, walk rootward, examining the new references, stopping if a longer suffix is found. The running time of the latter step is linear, because each tree edge is examined exactly once.
Life with a bounded window is unfortunately more difficult. Instead of propagating one reference, we propagate several. To compute the set of suffixes referenced from a node, we merge them in sorted order by length. Then whenever we have suffixes of lengths x > y > z, if x - z < 8192 (e.g.), then we can drop y, because all three are equally good for all suffixes with which the current node is the leafmost common ancestor, and if y is in window, then either x or z is. Since the window is a large fraction of the total file, each node will have at most a handful of references.
If you want to look back the shortest distance possible, then there's an O(n log^2 n)-time algorithm (probably improvable to O(n log n) with various hard-to-implement magic). In the course of the algorithm, we construct, for each node, a binary search tree of the descendant suffixes by length, then do next-longest lookups. To construct the node's tree from its childrens', start with the largest child tree and insert the elements from the others. By a heavy path argument, each length is inserted O(log n) times.

Related

The fastest C++ algorithm for string testing against a list of predefined seeds (case insensitive)

I have list of seed strings, about 100 predefined strings. All strings contain only ASCII characters.
std::list<std::wstring> seeds{ L"google", L"yahoo", L"stackoverflow"};
My app constantly receives a lot of strings which can contain any characters. I need check each received line and decide whether it contain any of seeds or not. Comparison must be case insensitive.
I need the fastest possible algorithm to test received string.
Right now my app uses this algo:
std::wstring testedStr;
for (auto & seed : seeds)
{
if (boost::icontains(testedStr, seed))
{
return true;
}
}
return false;
It works well, but I'm not sure that this is the most efficient way.
How is it possible to implement the algorithm in order to achieve better performance?
This is a Windows app. App receives valid std::wstring strings.
Update
For this task I implemented Aho-Corasick algo. If someone could review my code it would be great - I do not have big experience with such algorithms. Link to implementation: gist.github.com
If there are a finite amount of matching strings, this means that you can construct a tree such that, read from root to leaves, similar strings will occupy similar branches.
This is also known as a trie, or Radix Tree.
For example, we might have the strings cat, coach, con, conch as well as dark, dad, dank, do. Their trie might look like this:
A search for one of the words in the tree will search the tree, starting from a root. Making it to a leaf would correspond to a match to a seed. Regardless, each character in the string should match to one of their children. If it does not, you can terminate the search (e.g. you would not consider any words starting with "g" or any words beginning with "cu").
There are various algorithms for constructing the tree as well as searching it as well as modifying it on the fly, but I thought I would give a conceptual overview of the solution instead of a specific one since I don't know of the best algorithm for it.
Conceptually, an algorithm you might use to search the tree would be related to the idea behind radix sort of a fixed amount of categories or values that a character in a string might take on at a given point in time.
This lets you check one word against your word-list. Since you're looking for this word-list as sub-strings of your input string, there's going to be more to it than this.
Edit: As other answers have mentioned, the Aho-Corasick algorithm for string matching is a sophisticated algorithm for performing string matching, consisting of a trie with additional links for taking "shortcuts" through the tree and having a different search pattern to accompany this. (As an interesting note, Alfred Aho is also a contributor to the the popular compiler textbook, Compilers: Principles, Techniques, and Tools as well as the algorithms textbook, The Design And Analysis Of Computer Algorithms. He is also a former member of Bell Labs. Margaret J. Corasick does not seem to have too much public information on herself.)
You can use Aho–Corasick algorithm
It builds trie/automaton where some vertices marked as terminal which would mean string has seeds.
It's built in O(sum of dictionary word lengths) and gives the answer in O(test string length)
Advantages:
It's specifically works with several dictionary words and check time doesn't depend on number of words (If we not consider cases where it doesn't fit to memory etc)
The algorithm is not hard to implement (comparing to suffix structures at least)
You may make it case insensitive by lowering each symbol if it's ASCII (non ASCII chars don't match anyway)
You should try a pre-existing regex utility, it may be slower than your hand-rolled algorithm but regex is about matching multiple possibilities, so it is likely it will be already several times faster than a hashmap or a simple comparison to all strings. I believe regex implementations may already use the Aho–Corasick algorithm mentioned by RiaD, so basically you will have at your disposal a well tested and fast implementation.
If you have C++11 you already have a standard regex library
#include <string>
#include <regex>
int main(){
std::regex self_regex("google|yahoo|stackoverflow");
regex_match(input_string ,self_regex);
}
I expect this to generate the best possible minimum match tree, so I expect it to be really fast (and reliable!)
One of the faster ways is to use suffix tree https://en.wikipedia.org/wiki/Suffix_tree, but this approach has huge disadvantage - it is difficult data structure with difficult constructing. This algorithm allows to build tree from string in linear complexity https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm
Edit: As Matthieu M. pointed out, the OP asked if a string contains a keyword. My answer only works when the string equals the keyword or if you can split the string e.g. by the space character.
Especially with a high number of possible candidates and knowing them at compile time using a perfect hash function with a tool like gperf is worth a try. The main principle is, that you seed a generator with your seed and it generates a function that contains a hash function which has no collisions for all seed values. At runtime you give the function a string and it calculates the hash and then it checks if it is the only possible candidate corresponding to the hashvalue.
The runtime cost is hashing the string and then comparing against the only possible candidate (O(1) for seed size and O(1) for string length).
To make the comparison case insensitive you just use tolower on the seed and on your string.
Because number of string is not big (~100), you can use next algo:
Calculate max length of word you have. Let it be N.
Create int checks[N]; array of checksum.
Let's checksum will be sum of all characters in searching phrase. So, you can calculate such checksum for each word from your list (that is known at compile time), and create std::map<int, std::vector<std::wstring>>, where int is checksum of string, and vector should contain all your strings with that checksum.
Create array of such maps for each length (up to N), it can be done at compile time also.
Now move over big string by pointer. When pointer points to X character, you should add value of X char to all checks integers, and for each of them (numbers from 1 to N) remove value of (X-K) character, where K is number of integer in checks array. So, you will always have correct checksum for all length stored in checks array.
After that search on map does there exists strings with such pair (length & checksum), and if exists - compare it.
It should give false-positive result (when checksum & length is equal, but phrase is not) very rare.
So, let's say R is length of big string. Then looping over it will take O(R).
Each step you will perform N operations with "+" small number (char value), N operations with "-" small number (char value), that is very fast. Each step you will have to search for counter in checks array, and that is O(1), because it's one memory block.
Also each step you will have to find map in map's array, that will also be O(1), because it's also is one memory block.
And inside map you will have to search for string with correct checksum for log(F), where F is size of map, and it will usually contain no more then 2-3 strings, so we can in general pretend that it is also O(1).
Also you can check, and if there is no strings with same checksum (that should happens with high chance with just 100 words), you can discard map at all, storing pairs instead of map.
So, finally that should give O(R), with quite small O.
This way of calculating checksum can be changed, but it's quite simple and completely fast, with really rare false-positive reactions.
As a variant on DarioOO’s answer, you could get a possibly faster implementation of a regular expression match, by coding a lex parser for your strings. Though normally used together with yacc, this is a case where lex on its own does the job, and lex parsers are usually very efficient.
This approach might fall down if all your strings are long, as then an algorithm such as Aho-Corasick, Commentz-Walter or Rabin-Karp would probably offer significant improvements, and I doubt that lex implementations use any such algorithm.
This approach is harder if you have to be able to configure the strings without reconfiguration, but since flex is open source you could cannibalise its code.

Suffix tree vs Suffix array for LCS

I'm working on a program to find the longest common substring between multiple strings. I've lowered my approach down to either using suffix array's or a suffix tree. I want to see which is the better approach (if there is one) and why. Also for suffix array's I've seen a few algorithms for two strings but not any for more then two strings. Any solid examples would be appreciated, thanks again for the advice!
Note: I didn't see any other questions that specifically addressed this issue, but if they exist please point me in that direction!
If you have a substring that occurs in all sequences, then in a suffix array, the pointers to each occurrence of that substring must sort close together. So you could attempt to find them by moving a window along the suffix array, where the window is just large enough to contain at least one occurrence of each sequence. You could do this in linear time by maintaining a table that tells you, for each sequence, how many times that sequence occurs within that window. Then when you move the rear end of the window forwards decrement the count for the sequence associated with the pointer you have just skipped over and, if necessary, move the forward end of the window just far enough to pick up a new occurrence of this sequence and update the table.
Now you need to be able to find the length of the common prefix shared by all substrings starting at the pointers in the window. This should be the minimum LCP value occurring between the pointers in the window. If you use a red-black tree, such as a Java Treeset, with a key which consists of the LCP value as most significant component and some tie-breaker such as the pointer itself as a less significant component, then you can maintain the minimum LCP value within the window at a cost of roughly log window size per window adjustment.

When do we actually use a Trie?

I am starting to read about Trie. I got also references from friends here in: Tutorials on Trie
I am not clear on the following:
It seems that to go on and use a Trie one assumes that all the input strings that will be the search space and used to build the Trie are separated in distinct word boundaries.
E.g. all the example tutorials I have seen use input such as:
S={ball, bid, byte, car, cat, mac, map etc...}
Then we build the trie from S and do our searches (really fast)
My question is: How did we end up with S to begin with?
I mean before starting to read about tries I imagined that S would be an arbitrarily long text e.g. A Shakespeare passage.
Then using a Trie we could find things really fast.
But it seems this is not the case.
Is the assumption here that the input passage (of Shakespeare for example) is pre-processed first extracting all the words to get S?
So if one wants to search for patterns (same way as you do when you Google and see all pages having also spaces in your search query) a Trie is not appropriate?
When can we know if a Trie is the data structure that we can actually use?
Tries are useful where you have a fixed dictionary you want to look up quickly. Compared to a hashtable it may require less storage for a large dictionary but may well take longer to look up. One example place I have used it is for mapping URLs to operations on a web server were there may be inheritance of functionality based on the prefix. Here recursing down a trie enables appropriate lookup of all of the methods that need to be called for a particular url. It would also be efficient for storing a dictionary.
For doing text searches you would typically represent documents using a token vector of leximes with weights (perhaps based on occurance frequency), and then search against that to get a ranking of documents against a particular search vector. There a number of standard libraries to do this which I would suggest using rather than writing your own - particularly for removing stopwords, dealing with synonyms and stemming.
We can use tries for sub string searching in linear time, without pre processing the string every time. You can get a best tutorial on suffix tree generation #
Ukkonen's suffix tree algorithm in plain English?
As the other examples have said, a trie is useful because it provides fast string look-ups (or, more generally, look-ups for any sequence). Some examples of where I've used tries:
My answer to this question uses a (slightly modified) trie for matching sentences: it is a trie based on a sequence of words, rather than a sequence of characters. (The other answers to that question probably demonstrate the trie in action more clearly.)
I've also used a trie in a game which had a large number of rooms with names (the total number and the names were defined at run time), each of these names has to be unique and one had to be able to search for a room with a given name quickly. A hash table could also have been used, but in some ways a trie is simpler to implement and faster when using strings. (My trie implementation ended up being ~50 lines of C.)
The trie tag probably has many more examples.
There are multiple ways to use tries. The typical example is a lookup such as the one you have presented. However Tries can also be used to fully index a complete text. Either you use the Ukkonen suffix tree algorithm, to produce a suffix trie, or you explicetly construct the suffix trie by storing suffixes (much slower than Ukkonens algorithm, but also much simpler). As this is preprocessing, which needs to be done only once speed is not that crucial.
For this you would just take your text, insert the full text, then chop of the first letter, insert the resulting text, chop of second letter, insert...
So if we have the text "The Text" we would insert the following set:
{"The Text", "he Text", "e Text", " Text", "Text", "ext", "xt", "t"}
In the resulting suffix trie we can easily search for any kind of prefix. Also this is space efficient, because we do not need to store the whole string, since common prefixes are stored only once.
If you need to store much longer strings space efficiently it is best not only to store prefixes together but also suffixes. In that case you could build up a directed acyclic word graph (DAWG), which is very similar to a trie in conception.
So a trie in that sense allows finding arbitrary substrings, including partial words. If you are only interested in storing words, a different data structure should be used, for example a inverted list (if order is important) or a vector space based retrieval algorithm (in case word order does not matter).

Hard sorting problem - what type of algorithm should I be using?

The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
-
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)

Prefix search in a radix tree/patricia trie

I'm currently implementing a radix tree/patricia trie (whatever you want to call it). I want to use it for prefix searches in a dictionary on a severely underpowered piece of hardware. It's supposed to work more or less like auto-completion, i. e. showing a list of words that the typed prefix matches.
My implementation is based on this article, but the code therein doesn't include prefix searches, though the author says:
[...] Say you want to enumerate all the nodes that have keys with a common prefix "AB". You can perform a depth first search starting at that root, stopping whenever you encounter back edges.
But I don't see how that is supposed to work. For example, if I build a radix tree from these words:
illness
imaginary
imagination
imagine
imitation
immediate
immediately
immense
in
I will get the exact same "best match" for the prefixes "i" and "in" so that it seems difficult to me to gather all matching words just by traversing the tree from that best match.
Additionally, there is a radix tree implementation in Java that has an implemented prefix search in RadixTreeImpl.java. That code explicitly checks all nodes (starting from a certain node) for a prefix match - it actually compares bytes.
Can anyone point me to a detailed description on implementing a prefix search on radix trees? Is the algorithm used in the Java implementation the only way to do it?
Think about what your trie encodes. At each node, you have the path that lead you to that node, so in your example, you start at Λ (that's a capital Lambda, this greek font kind of sucks) the root node corresponding to an empty string. Λ has children for each letter used, so in your data set, you have one branch, for "i".
Λ
Λ→"i"
At the "i" node, there are two children, one for "m" and one for "n". The next letter is "n", so you take that,
Λ→"i"→"n"
and since the only word that starts "i","n" in your data set is "in", there are no children from "n". That's a match.
Now, let's say the data set, instead of having "in", had "infindibulum". (What SF I'm referencing is left as an exercise.) You'd still get to the "n" node the same way, but then if the next letter you get is "q", you know the word doesn't appear in your data set at all, because there's no "q" branch. At that point, you say "okay, no match." (Maybe you then start adding the word, maybe not, depending on the application.)
But if the next letter is "f", you can keep going. You can short circuit that with a little craft, though: once you reach a node that represents a unique path, you can hang the whole string off that node. When you get to that node, you know that the rest of the string must be "findibulum", so you've used the prefix to match the whole string, and return it.
How your you use that? in a lot of non-UNIX command interpreters, like the old VAX DCL, you could use any unique prefix of a command. So, the equivalent of ls(1) was DIRECTORY, but no other command started with DIR, so you could type DIR and that was as good as doing the whole word. If you couldn't remember the correct command, you could type just 'D', and hit (I think) ESC; the DCL CLI would return you all the commands that started with D, which it could search extremely fast.
It turns out the GNU extensions for the standard c++ lib includes a Patricia trie implementation. It's found under the policy-based data-structures extension. See http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/trie_based_containers.html
An alternative algorithm: Keep It Simple Stupid!
Just make a sorted list of your keywords. When you have a prefix, binary search to find where that prefix would be located in the list. All of your possible completions will be found starting at that index, ready to be accessed in place.
This algorithm will will require only 5% of the code of a Patricia trie and will be easy to maintain, understand, and update. It is almost certain this simple list search will be more efficient as well.
The only downside is if you have huge numbers of long keywords with similar prefixes, a trie can save some storage since it doesn't need to keep the full prefix for every entry. In practice, if you have less than a few million words, this is not a savings because the pointer overhead of the tree will dominate. This savings is more for applications like searching databases of DNA strings with millions of characters, not text keywords.
Another alternative algo is a ternary search tree (more memory efficient) https://github.com/varunpant/TernaryTree/tree/master/TernaryTree