sequential pattern or itemset fp tree - data-mining

FP-growth algorithms are used for Itemset Mining. Is there a way to use these algorithms for Sequential Pattern Mining instead of Itemset Mining?

The FPGrowth algorithm is defined to be used on transactions to find itemsets. Thus, it does not care about the order of items, and each item can only appear once in a transaction.
If you want to apply it to sequences to find sequential patterns, then this is a more general problem. In other words, itemset mining is a special case of sequential pattern mining. To handle this problem, you would need to generalize FPGrowth. First, you would need to modify the FPTree to store sequences where items can appear more than once. This means to change how the branch of the trees are created. But also you would need to change how links between node representing items are treated since the same item can appear multiple times per sequence.
But is it really a good idea? I am not sure about it. There are many sequential pattern mining algorithms. For example, you can use several imlementation in my SPMF data mining library (http://www.philippe-fournier-viger.com/spmf/ ) impltemented in Java, so you don't need to implement it by yourself.

Related

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents
consisting in first computing their minhash signatures (from their shingles, or
n-grams), and then use an LSH-based algorithm to cluster them efficiently
(i.e. avoid the quadratic complexity which would entail a naive pairwise
exhaustive search).
What I'm trying to do is to bridge three different algorithms, which I think are
all related to this minhash + LSH framework, but in non-obvious ways:
(1) The algorithm sketched in Section 3.4.3 of Chapter 3 of the book Mining of Massive Datasets (Rajaraman and Ullman), which seems to be the canonical description of minhashing
(2) Ryan Moulton's Simple Simhashing article
(3) Charikar's so-called SimHash algorihm, described in this article
I find this confusing because what I assume is that although (2) uses the term
"simhashing", it's actually doing minhashing in a way similar to (1), but with
the crucial difference that a cluster can only be represented by a single
signature (even tough multiple hash functions might be involved), while two
documents have more chances of being similar with (1), because the signatures
can collide in multiple "bands". (3) seems like a different beast altogether, in
that the signatures are compared in terms of their Hamming distance, and the LSH
technique implies multiple sorting of the signatures, instead of banding them. I
also saw (somewhere else) that this last technique can incorporate a notion of
weighting, which can be used to put more emphasis on certain document parts, and
which seems to lack in (1) and (2).
So at last, my two questions:
(a) Is there a (satisfying) way in which to bridge those three algorithms?
(b) Is there a way to import this notion of weighting from (3) into (1)?
Article 2 is actually discussing minhash, but has erroneously called it simhash. That's probably why it is now deleted (it's archived here). Also, confusingly, it talks about "concatenating" multiple minhashes, which as you rightly observe reduces the chance of finding two similar documents. It seems to prescribe an approach that yields only a single "band", which will give very poor results.
Can the algorithms be bridged/combined?
Probably not. To see why, you should understand what the properties of the different hashes are, and how they are used to avoid n2 comparisons between documents.
Properties of minhash and simhash:
Essentially, minhash generates multiple hashes per document, and when there are two similar documents it is likely that a subset of these hashes will be identical. Simhash generates a single hash per document, and where there are two similar documents it is likely that their simhashes will be similar (having a small hamming distance).
How they solve the n2 problem:
With minhash you index all hashes to the documents that contain them; so if you are storing 100 hashes per document, then for each new incoming document you need to look up each of its 100 hashes in your index and find all documents that share at least (e.g.) 50% of them. This could mean building large temporary tallies, as hundreds of thousands of documents could share at least one hash.
With simhash there is a clever technique of storing each document's hash in multiple permutations in multiple sorted tables, such that any similar hash up to a certain hamming distance (3, 4, 5, possibly as high as 6 or 7 depending on hash size and table structure) is guaranteed to be found nearby in at least one of these tables, differing only in the low order bits. This makes searching for similar documents efficient, but restricts you to only finding very similar documents. See this simhash tutorial.
Because the use of minhash and simhash are so different, I don't see a way to bridge/combine them. You could theoretically have a minhash that generates 1-bit hashes and concatenate them into something like a simhash, but I don't see any benefit in this.
Can weighting be used in minhash?
Yes. Think of the minhashes as slots: if you store 100 minhashes per document, that's 100 slots you can fill. These slots don't all have to be filled from the same selection of document features. You could devote a certain number of slots to hashes calculated just from specific features (care should be taken, though, to always select features in such a way that the same features will be selected in similar documents).
So for example if you were dealing with journal articles, you could assign some slots for minhashes of document title, some for minhashes of article abstract, and the remainder of the slots for minhashes of the document body. You can even set aside some individual slots for direct hashes of things such as author's surname, without bothering about the minhash algorithm.
It's not quite as elegant as how simhash does it, where in effect you're just creating a bitwise weighted average of all the individual feature hashes, then rounding each bit up to 1 or down to 0. But it's quite workable.

Common patterns in a database

I need to find common patterns in a database of sequences of events. So, I have considered the longest common substring problem and the python implementation searching for a solution.
Note that I am not searching for the longest common substring only: I accept shorter common substrings appearing frequently in the database.
Can you suggest some algorithm, implementation tricks or general advice about this problem?
The previous answer suggested Apriori. But Apriori is inappropriate if you want to find frequent sequences because Apriori does not consider the time (also, Apriori is an inefficient algorithm).
If you want to find subsequences that are common to several sequences, it would be more appropriate to use a sequential pattern mining algorithm such as PrefixSpan and SPAM.
If you want to make some predictions, another option would also be to use a sequential rule mining algorithm.
I have open-source Java implementations of sequential pattern mining and sequential rule mining algorithm that you can download from my website: http://www.philippe-fournier-viger.com/spmf/
I don't think that you could process 8 GB of data in one shot with these algorithms. But it could be a starting point. Actually, some of these algorithms could be adapted for the case of very large databases by implementing a disk-based strategy.
Have you considered Frequent Itemset Mining methods such as Apriori?

Building an Intrusion Detection System using fuzzy logic

I want to develop an Intrusion Detection System (IDS) that might be used with one of the KDD datasets. In the present case, my dataset has 42 attributes and more than 4,000,000 rows of data.
I am trying to build my IDS using fuzzy association rules, hence my question: What is actually considered as the best tool for fuzzy logic in this context?
Fuzzy association rule algorithms are often extensions of normal association rule algorithms like Apriori and FP-growth in order to model uncertainty using probability ranges. I thus assume that your data consists of quite uncertain measurements and therefore you want to group the measurements in more general ranges like e.g. 'low'/'medium'/'high'. From there on you can use any normal association rule algorithm to find the rules for your IDS (I'd suggest FP-growth as it has lower complexity than Apriori for large data sets).

Regular Expression for detecting repeated substrings is SLOW

I am trying to come up with a GNU extended regular expression that detects repeated substrings in a string of ascii-encoded bits. I have an expression that works -- sort of. The problem is that it executes really slowly when given a string that could have many solutions
The expression
([01]+)(\1)+
compiles quickly, but takes about a minute to execute against the string
1010101010101010101010101010101010101010101010101010101010
I am using the regex implementation from glibc 2.5-49 ( comes with CentOS 5.5.)
FWIW, the pcre library executes quickly, as in gregexp or perl directly. So the obvious, but wrong, answer is "use libpcre". I cannot easily introduce a new dependency in my project. I need to work within the std C library that comes with CentOS/RHEL.
If the input string can be of any considerable length, or if performance is at all a concern, then one of the better ways to solve this problem is not with regex, but with a more sophisticated string data structure that facilitates these kinds of queries much more efficiently.
Such a data structure is a suffix tree. Given a string S, its suffix tree is essentially the Patricia trie of all of its suffixes. Despite its seeming complexity, it can be built in linear time.
Suffix tree for "BANANA"(courtesy of Wikipedia)
You can do many kinds of queries really efficiently with a suffix tree, e.g. finding all occurences of a substring, the longest substring that occurs at least twice, etc. The kind of strings that you're after is called tandem repeats. To facilitate this query you'd have to preprocess the suffix tree in linear time so you can do lowest common ancestor queries in constant time.
This problem is very common in computational biology, where the DNA can be viewed as a VERY long string consisting of letters in ACGT. Thus, performance and efficiency is of utmost importance, and these very sophisticated algorithms and techniques were devised.
You should look into either implementing these techniques from scratch for your binary sequence, or perhaps it's easier to map your binary sequence to a "fake" DNA string and then using one of the many tools available for gene research.
See also
Wikipedia/Tandem repeats

What is the best autocomplete/suggest algorithm,datastructure [C++/C]

We see Google, Firefox some AJAX pages show up a list of probable items while user types characters.
Can someone give a good algorithm, data structure for implementing autocomplete?
A trie is a data structure that can be used to quickly find words that match a prefix.
Edit: Here's an example showing how to use one to implement autocomplete http://rmandvikar.blogspot.com/2008/10/trie-examples.html
Here's a comparison of 3 different auto-complete implementations (though it's in Java not C++).
* In-Memory Trie
* In-Memory Relational Database
* Java Set
When looking up keys, the trie is marginally faster than the Set implementation. Both the trie and the set are a good bit faster than the relational database solution.
The setup cost of the Set is lower than the Trie or DB solution. You'd have to decide whether you'd be constructing new "wordsets" frequently or whether lookup speed is the higher priority.
These results are in Java, your mileage may vary with a C++ solution.
For large datasets, a good candidate for the backend would be Ternary search trees. They combine the best of two worlds: the low space overhead of binary search trees and the character-based time efficiency of digital search tries.
See in Dr. Dobbs Journal: http://www.ddj.com/windows/184410528
The goal is the fast retrieval of a finite resultset as the user types in. Lets first consider that to search "computer science" you can start typing from "computer" or "science" but not "omputer". So, given a phrase, generate the sub-phrases starting with a word. Now for each of the phrases, feed them into the TST (ternary search tree). Each node in the TST will represent a prefix of a phrase that has been typed so far. We will store the best 10 (say) results for that prefix in that node. If there are many more candidates than the finite amount of results (10 here) for a node, there should be a ranking function to resolve competition between two results.
The tree can be built once every few hours, depending on the dynamism of the data. If the data is in real time, then I guess some other algorithm will give a better balance. In this case, the absolute requirement is the lightning-fast retrieval of results for every keystroke typed which it does very well.
More complications will arise if the suggestion of spelling corrections is involved. In that case, the edit distance algorithms will have to be considered as well.
For small datasets like a list of countries, a simple implementation of Trie will do. If you are going to implement such an autocomplete drop-down in a web application, the autocomplete widget of YUI3 will do everything for you after you provide the data in a list. If you use YUI3 as just the frontend for an autocomplete backed by large data, make the TST based web services in C++, and then use script node data source of the autocomplete widget to fetch data from the web service instead of a simple list.
Segment trees can be used for efficiently implementing auto complete
If you want to suggest the most popular completions, a "Suggest Tree" may be a good choice:
Suggest Tree
For a simple solution : you generate a 'candidate' with a minimum edit (Levenshtein) distance (1 or 2) then you test the existence of the candidate with a hash container (set will suffice for a simple soltion, then use unordered_set from the tr1 or boost).
Example:
You wrote carr and you want car.
arr is generated by 1 deletion. Is arr in your unordered_set ? No. crr is generated by 1 deletion. Is crr in your unordered_set ? No. car is generated by 1 deletion. Is car in your unordered_set ? Yes, you win.
Of course there's insertion, deletion, transposition etc...
You see that your algorithm for generating candidates is really where you’re wasting time, especially if you have a very little unordered_set.