Common patterns in a database - regex

I need to find common patterns in a database of sequences of events. So, I have considered the longest common substring problem and the python implementation searching for a solution.
Note that I am not searching for the longest common substring only: I accept shorter common substrings appearing frequently in the database.
Can you suggest some algorithm, implementation tricks or general advice about this problem?

The previous answer suggested Apriori. But Apriori is inappropriate if you want to find frequent sequences because Apriori does not consider the time (also, Apriori is an inefficient algorithm).
If you want to find subsequences that are common to several sequences, it would be more appropriate to use a sequential pattern mining algorithm such as PrefixSpan and SPAM.
If you want to make some predictions, another option would also be to use a sequential rule mining algorithm.
I have open-source Java implementations of sequential pattern mining and sequential rule mining algorithm that you can download from my website: http://www.philippe-fournier-viger.com/spmf/
I don't think that you could process 8 GB of data in one shot with these algorithms. But it could be a starting point. Actually, some of these algorithms could be adapted for the case of very large databases by implementing a disk-based strategy.

Have you considered Frequent Itemset Mining methods such as Apriori?

Related

Best string-comparison algorithm for regex

Given a regex, I want to compare it with a list of other regex, and output a similarity score.
There are several edit distance algorithms out there (e.g. levenshtein distance), but they fail to compare regex's, e.g.:
R1: [a-z0-9]+
R2: [0-9]{1}[a-z0-9]+
Distance: 9
In the example above, both regex's are quite similar, however they have a quite high edit distance. I suppose an approach using character n-grams would be more suitable for such cases.
What algorithm/approach would you consider for this problem?
It seems you're unlikely to improve upon the regular expression parsing algorithm present in an engine itself, because you're ultimately going to be making inferences about combinations of rules.
There are a number of open source regular expression engines, many listed on wikipedia, possibly including the one you're using.
Without having looked at the internals myself (not an insignificant caveat,) my recommendation is to see if it's possible to modify a regex engine (or leverage some pre-existing debugging or testing code) to output pertinent rules-processing metadata, sub-scores, if you will, from which you can then calculate an aggregate. The engines ultimately do their work deterministically, so this is theoretically possible.
If it works, this will amongst other things, enable you classify constructs, which you define as similar, with similar weights, and to possibly ignore others entirely.

sequential pattern or itemset fp tree

FP-growth algorithms are used for Itemset Mining. Is there a way to use these algorithms for Sequential Pattern Mining instead of Itemset Mining?
The FPGrowth algorithm is defined to be used on transactions to find itemsets. Thus, it does not care about the order of items, and each item can only appear once in a transaction.
If you want to apply it to sequences to find sequential patterns, then this is a more general problem. In other words, itemset mining is a special case of sequential pattern mining. To handle this problem, you would need to generalize FPGrowth. First, you would need to modify the FPTree to store sequences where items can appear more than once. This means to change how the branch of the trees are created. But also you would need to change how links between node representing items are treated since the same item can appear multiple times per sequence.
But is it really a good idea? I am not sure about it. There are many sequential pattern mining algorithms. For example, you can use several imlementation in my SPMF data mining library (http://www.philippe-fournier-viger.com/spmf/ ) impltemented in Java, so you don't need to implement it by yourself.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

Regular Expression for detecting repeated substrings is SLOW

I am trying to come up with a GNU extended regular expression that detects repeated substrings in a string of ascii-encoded bits. I have an expression that works -- sort of. The problem is that it executes really slowly when given a string that could have many solutions
The expression
([01]+)(\1)+
compiles quickly, but takes about a minute to execute against the string
1010101010101010101010101010101010101010101010101010101010
I am using the regex implementation from glibc 2.5-49 ( comes with CentOS 5.5.)
FWIW, the pcre library executes quickly, as in gregexp or perl directly. So the obvious, but wrong, answer is "use libpcre". I cannot easily introduce a new dependency in my project. I need to work within the std C library that comes with CentOS/RHEL.
If the input string can be of any considerable length, or if performance is at all a concern, then one of the better ways to solve this problem is not with regex, but with a more sophisticated string data structure that facilitates these kinds of queries much more efficiently.
Such a data structure is a suffix tree. Given a string S, its suffix tree is essentially the Patricia trie of all of its suffixes. Despite its seeming complexity, it can be built in linear time.
Suffix tree for "BANANA"(courtesy of Wikipedia)
You can do many kinds of queries really efficiently with a suffix tree, e.g. finding all occurences of a substring, the longest substring that occurs at least twice, etc. The kind of strings that you're after is called tandem repeats. To facilitate this query you'd have to preprocess the suffix tree in linear time so you can do lowest common ancestor queries in constant time.
This problem is very common in computational biology, where the DNA can be viewed as a VERY long string consisting of letters in ACGT. Thus, performance and efficiency is of utmost importance, and these very sophisticated algorithms and techniques were devised.
You should look into either implementing these techniques from scratch for your binary sequence, or perhaps it's easier to map your binary sequence to a "fake" DNA string and then using one of the many tools available for gene research.
See also
Wikipedia/Tandem repeats

What is the best autocomplete/suggest algorithm,datastructure [C++/C]

We see Google, Firefox some AJAX pages show up a list of probable items while user types characters.
Can someone give a good algorithm, data structure for implementing autocomplete?
A trie is a data structure that can be used to quickly find words that match a prefix.
Edit: Here's an example showing how to use one to implement autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html
Here's a comparison of 3 different auto-complete implementations (though it's in Java not C++).
* In-Memory Trie
* In-Memory Relational Database
* Java Set
When looking up keys, the trie is marginally faster than the Set implementation. Both the trie and the set are a good bit faster than the relational database solution.
The setup cost of the Set is lower than the Trie or DB solution. You'd have to decide whether you'd be constructing new "wordsets" frequently or whether lookup speed is the higher priority.
These results are in Java, your mileage may vary with a C++ solution.
For large datasets, a good candidate for the backend would be Ternary search trees. They combine the best of two worlds: the low space overhead of binary search trees and the character-based time efficiency of digital search tries.
See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528
The goal is the fast retrieval of a finite resultset as the user types in. Lets first consider that to search "computer science" you can start typing from "computer" or "science" but not "omputer". So, given a phrase, generate the sub-phrases starting with a word. Now for each of the phrases, feed them into the TST (ternary search tree). Each node in the TST will represent a prefix of a phrase that has been typed so far. We will store the best 10 (say) results for that prefix in that node. If there are many more candidates than the finite amount of results (10 here) for a node, there should be a ranking function to resolve competition between two results.
The tree can be built once every few hours, depending on the dynamism of the data. If the data is in real time, then I guess some other algorithm will give a better balance. In this case, the absolute requirement is the lightning-fast retrieval of results for every keystroke typed which it does very well.
More complications will arise if the suggestion of spelling corrections is involved. In that case, the edit distance algorithms will have to be considered as well.
For small datasets like a list of countries, a simple implementation of Trie will do. If you are going to implement such an autocomplete drop-down in a web application, the autocomplete widget of YUI3 will do everything for you after you provide the data in a list. If you use YUI3 as just the frontend for an autocomplete backed by large data, make the TST based web services in C++, and then use script node data source of the autocomplete widget to fetch data from the web service instead of a simple list.
Segment trees can be used for efficiently implementing auto complete
If you want to suggest the most popular completions, a "Suggest Tree" may be a good choice:
Suggest Tree
For a simple solution : you generate a 'candidate' with a minimum edit (Levenshtein) distance (1 or 2) then you test the existence of the candidate with a hash container (set will suffice for a simple soltion, then use unordered_set from the tr1 or boost).
Example:
You wrote carr and you want car.
arr is generated by 1 deletion. Is arr in your unordered_set ? No. crr is generated by 1 deletion. Is crr in your unordered_set ? No. car is generated by 1 deletion. Is car in your unordered_set ? Yes, you win.
Of course there's insertion, deletion, transposition etc...
You see that your algorithm for generating candidates is really where you’re wasting time, especially if you have a very little unordered_set.