What is the advantages of finding Maximal Frequent Itemsets - data-mining

I am searching for association rule mining in data mining. There are three type of frequent itemsets,
Frequent Itemsets
Closed Frequent Itemsets
Maximal Frequent Itemsets
To generate the association rules, we should to use frequent itemsets or closed frequent itemsets (The frequent itemsets can be found with closed frequent itemsets).
There is many algorithm to finding maximal frequent itemsets (MAFIA, Max-Miner, Depth Project, GenMax ...)
What is the advantages of finding maximal frequent itemsets? What is the main idea?
Thanks.

The main idea is that when looking for long itemsets with low support, you will end up exhausting all your memory with frequent but not interesting, redundant, and short itemsets.
To experience this, get some large, real data and run itemset mining on that; not the toy examples used in lectures.

when size of dataset & number of frequent itemset in data set is huge, finding all frequent itemset is infeasible. the Advantage of finding Maximal frequent itemset(MFI) over Frequent Itemset is (FI) is,
All MFI can be generated before finding all FI.
Once we find maximal frequent itemset , we can generate all frequent itemset in a single scan. Because every subset of frequent itemset is frequent.

Related

sequential pattern or itemset fp tree

FP-growth algorithms are used for Itemset Mining. Is there a way to use these algorithms for Sequential Pattern Mining instead of Itemset Mining?
The FPGrowth algorithm is defined to be used on transactions to find itemsets. Thus, it does not care about the order of items, and each item can only appear once in a transaction.
If you want to apply it to sequences to find sequential patterns, then this is a more general problem. In other words, itemset mining is a special case of sequential pattern mining. To handle this problem, you would need to generalize FPGrowth. First, you would need to modify the FPTree to store sequences where items can appear more than once. This means to change how the branch of the trees are created. But also you would need to change how links between node representing items are treated since the same item can appear multiple times per sequence.
But is it really a good idea? I am not sure about it. There are many sequential pattern mining algorithms. For example, you can use several imlementation in my SPMF data mining library (http://www.philippe-fournier-viger.com/spmf/ ) impltemented in Java, so you don't need to implement it by yourself.

Real reason for speed up in fasttext

What is the real reason for speed-up, even though the pipeline mentioned in the fasttext paper uses techniques - negative sampling and heirerchichal softmax; in earlier word2vec papers. I am not able to clearly understand the actual difference, which is making this speed up happen ?
Is there that much of a speed-up?
I don't think there are any algorithmic breakthroughs which make the word2vec-equivalent word-vector training in FastText significantly faster. (And if you're using the character-ngrams option in FastText, to allow post-training synthesis of vectors for unseen words based on substrings shared with training-words, I'd expect the training to be slower, because every word requires training of its substring vectors as well.)
Any speedups in FastText are likely just because the code is well-tuned, with the benefit of more implementation experience.
To be efficient on datasets with a very large number of categories, Fast text uses a hierarchical classifier instead of a flat structure, in which the different categories are organized in a tree (think binary tree instead of list). This reduces the time complexities of training and testing text classifiers from linear to logarithmic with respect to the number of classes. FastText also exploits the fact that classes are imbalanced (some classes appearing more often than other) by using the Huffman algorithm to build the tree used to represent categories. The depth in the tree of very frequent categories is, therefore, smaller than for infrequent ones, leading to further computational efficiency.
Reference link: https://research.fb.com/blog/2016/08/fasttext/

Dataset limit dimension

My aim is to produce significant rules on diagnostic data.
I preprocess my dataset in arff no sparse format; I have 116000 instances and 28 attribute.
I apply Apriori algorithm in weka like this (using weka explorer interface)
Apriori -N 20 -T 1 -C 0.8 -D 0.05 -U 1.0 -M 0.1 -M 0.1 -S -1.0 -c -1
The algorithm seem to take a long time to finish.
Currently are more than three hours that await the end.
Is normal? There is a way to speed up the algorithm (preprocess the data in some other way, or choose some other parameter for Apriori)? or weka is not the right tool for this dimension?
You could subsample or partition your dataset and run the Apriori algorithm on every partition or some of them, and then combine the obtained rules.
Some algorithms could take a long time to finish, and that's proportional to several factors (num instances, num attribs, tipe of attribs), depending on the algorithm (spatial and temporal computational complexity). Weka is not particularly fast, apart from being Java, which is also not so fast as other compiled languages.
Sometimes, it is faster to run several times an algorithm on much smaller partitions of your dataset, due to the mentioned comp. complexity.
For example, if your algorithm takes time proportional to the square of the number of instances, cN^2, it is faster to run 10 times that on a 10 times smaller partition, 10c((N^2)/(10^2))=0.1c(N^2)
Hope to have helped.
Weka like many other data mining libraries only offer the two most famous algorithms: Apriori and FPGrowth. Apriori is an old algorithm is well known for being inefficient. Moreover, the Weka implementation of both Apriori and FPGrowth are slow.
If you want better Java implementations and more algorithms, you can check the SPMF open-source data mining library (I'm the founder), which offers the largest collection of pattern mining algorithms (more than 110 algorithms). For itemset mining, it offers Apriori and FPGrowth but also many other algorithms such as Eclat (2000) HMine(2005), LCM (the fastest at the FIMI 2004 competition) and some newer such as FIN (2014), PrePost (2014) and Prepost+ (2015), which can be faster than previous algorithms. Besides, it also offers many variations of these algorithms such as for mining rare itemsets, correlated itemsets, high utility itemsets, itemsets in uncertaint data, association rules, closed patterns, sequential patterns, sequential rules, etc.
There are some performance evaluation on the website that show that the SPMF implementations are much faster than the one of Weka for Apriori/FPGrowth.

Common patterns in a database

I need to find common patterns in a database of sequences of events. So, I have considered the longest common substring problem and the python implementation searching for a solution.
Note that I am not searching for the longest common substring only: I accept shorter common substrings appearing frequently in the database.
Can you suggest some algorithm, implementation tricks or general advice about this problem?
The previous answer suggested Apriori. But Apriori is inappropriate if you want to find frequent sequences because Apriori does not consider the time (also, Apriori is an inefficient algorithm).
If you want to find subsequences that are common to several sequences, it would be more appropriate to use a sequential pattern mining algorithm such as PrefixSpan and SPAM.
If you want to make some predictions, another option would also be to use a sequential rule mining algorithm.
I have open-source Java implementations of sequential pattern mining and sequential rule mining algorithm that you can download from my website: http://www.philippe-fournier-viger.com/spmf/
I don't think that you could process 8 GB of data in one shot with these algorithms. But it could be a starting point. Actually, some of these algorithms could be adapted for the case of very large databases by implementing a disk-based strategy.
Have you considered Frequent Itemset Mining methods such as Apriori?

what is the difference between Association rule mining & frequent itemset mining

i am new to data mining and confuse about Association rules and frequent item mining. for me i think both are same but i need views from experts on this forum
My question is
what is the difference between Association rule mining & frequent itemset mining?
Thanks
An association rule is something like "A,B → C", meaning that C tends to occur when A and B occur. An itemset is just a collection such as "A,B,C", and it is frequent if its items tend to co-occur. The usual way to look for association rules is to find all frequent itemsets and then postprocess them into rules.
The input of frequent itemset mining is :
a transaction database
a minimum support threshold minsup
The output is :
the set of all itemsets appearing in at least minsup transactions. An itemset is just a set of items that is unordered.
The input of assocition rule mining is :
a transaction database
a minimum support threshold minsup
a minimum confidence threshold minconf
The output is :
the set of all valid association rule. An association rule X-->Y is a relationship between two itemsets X and Y such that X and Y are disjoint and are not empty. A valid rule is a rule having a support higher or equals to minsup and a confidence higher or equal to minconf. The support is defined as sup(x-->Y) = sup (X U Y) / (number of transactions). The confidence is defined as conf(x-->Y) = sup (X U Y) / sup (X).
Now the relationship between itemset and association rule mining is that it is very efficient to use the frequent itemset to generate rules (see the paper by Agrawal 1993) for more details about this idea. So association rule mining will be broken down into two steps:
- mining frequent itemsets
- generating all valid association rules by using the frequent itemsets.
Frequent itemset mining is the first step of Association rule mining.
Once you have generated all the frequent itemsets, you proceed by iterating over them, one by one, enumerating through all the possible association rules, calculate their confidence, finally, if the confidence is > minConfidence, you output that rule.
Frequent itemset mining is a step of Association rules mining. After applying Frequent itemset mining algorithm like Apriori, FPGrowth on data, you will get frequent itemsets. From these
discovered frequent itemsets, you will generate association rules(Usually done by subset generation).
By using Association rule mining we will get the frequently itemsets that present in the given dataset. it also provide different types of algorithms for mining the frequent itemsets but it is done in different way that means either horizontal or vertical format. Apriori algorithm follow the horizontal format for mining the frequent itemsets and eclat algorithm follow the vertical format for mining the frequent datasets.
Association Rule mining:
Association rule mining is used to find the patterns in data.it finds the features which occur together and correlated.
Example:
For example, people who buy diapers are likely to buy baby powder. Or we can rephrase the statement by saying: If (people buy diaper), then (they buy baby powder). Note the if, then rule. This does not necessarily mean that if people buy baby powder, they buy diaper. In General, we can say that if condition A tends to B it does not necessarily mean that B tends to A.
Frequent item set mining:
Frequent item set mining is used to find the common item sets in data. it can generate association rules from the given transactional datasets.
Example:
If there are 2 items X and Y purchased frequently then its good to put them together in stores or provide some discount offer on one item on purchase of other item. This can really increase the sales. For example it is likely to find that if a customer buys Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy butter if he/she buys Milk and Bread.