How to apply a maximum support threshold in Apriori algorihtm? - data-mining

I have an assignment question:
Suppose you are given a task of finding all the association rules in a database whose
supports are between 20% and 80%, and the accuracy of the rules should be above 70%. Change the algorithm scheme of Apriori to find all the rules satisfying the above requirement.
I know that to get rules with minimum support of 20%, I remove itemsets with less than 20% support from candidate sets. But I am confused about how to apply a maximum threshold on support.
I can't remove the itemsets with support greater than 80% as it will remove itemsets with support within the allowed range down the steps.

Related

Dataset limit dimension

My aim is to produce significant rules on diagnostic data.
I preprocess my dataset in arff no sparse format; I have 116000 instances and 28 attribute.
I apply Apriori algorithm in weka like this (using weka explorer interface)
Apriori -N 20 -T 1 -C 0.8 -D 0.05 -U 1.0 -M 0.1 -M 0.1 -S -1.0 -c -1
The algorithm seem to take a long time to finish.
Currently are more than three hours that await the end.
Is normal? There is a way to speed up the algorithm (preprocess the data in some other way, or choose some other parameter for Apriori)? or weka is not the right tool for this dimension?
You could subsample or partition your dataset and run the Apriori algorithm on every partition or some of them, and then combine the obtained rules.
Some algorithms could take a long time to finish, and that's proportional to several factors (num instances, num attribs, tipe of attribs), depending on the algorithm (spatial and temporal computational complexity). Weka is not particularly fast, apart from being Java, which is also not so fast as other compiled languages.
Sometimes, it is faster to run several times an algorithm on much smaller partitions of your dataset, due to the mentioned comp. complexity.
For example, if your algorithm takes time proportional to the square of the number of instances, cN^2, it is faster to run 10 times that on a 10 times smaller partition, 10c((N^2)/(10^2))=0.1c(N^2)
Hope to have helped.
Weka like many other data mining libraries only offer the two most famous algorithms: Apriori and FPGrowth. Apriori is an old algorithm is well known for being inefficient. Moreover, the Weka implementation of both Apriori and FPGrowth are slow.
If you want better Java implementations and more algorithms, you can check the SPMF open-source data mining library (I'm the founder), which offers the largest collection of pattern mining algorithms (more than 110 algorithms). For itemset mining, it offers Apriori and FPGrowth but also many other algorithms such as Eclat (2000) HMine(2005), LCM (the fastest at the FIMI 2004 competition) and some newer such as FIN (2014), PrePost (2014) and Prepost+ (2015), which can be faster than previous algorithms. Besides, it also offers many variations of these algorithms such as for mining rare itemsets, correlated itemsets, high utility itemsets, itemsets in uncertaint data, association rules, closed patterns, sequential patterns, sequential rules, etc.
There are some performance evaluation on the website that show that the SPMF implementations are much faster than the one of Weka for Apriori/FPGrowth.

Feature importance based on extremely randomize trees and feature redundancy

I am using the Scikit-learn Extremely Randomized Trees algorithm to get info about the relative feature importances and I have a question about how "redundant features" are ranked.
If I have two features that are identical (redundant) and important to the classification, the extremely randomized trees cannot detect the redundancy of the features. That is, both features get a high ranking. Is there any other way to detect that two features are actualy redundant?
Maybe you could extract the top n important features and then compute pairwise Spearman's or Pearson's correlations for those in order to detect redundancy only for the top informative features as it might not be feasible to compute all pairwise feature correlations (quadratic with the number of features).
There might be more clever ways to do the same by exploiting the statistics of the relative occurrences of the features as nodes in the decision trees though.

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.
The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!
In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.
The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".

Building an Intrusion Detection System using fuzzy logic

I want to develop an Intrusion Detection System (IDS) that might be used with one of the KDD datasets. In the present case, my dataset has 42 attributes and more than 4,000,000 rows of data.
I am trying to build my IDS using fuzzy association rules, hence my question: What is actually considered as the best tool for fuzzy logic in this context?
Fuzzy association rule algorithms are often extensions of normal association rule algorithms like Apriori and FP-growth in order to model uncertainty using probability ranges. I thus assume that your data consists of quite uncertain measurements and therefore you want to group the measurements in more general ranges like e.g. 'low'/'medium'/'high'. From there on you can use any normal association rule algorithm to find the rules for your IDS (I'd suggest FP-growth as it has lower complexity than Apriori for large data sets).

what is the difference between Association rule mining & frequent itemset mining

i am new to data mining and confuse about Association rules and frequent item mining. for me i think both are same but i need views from experts on this forum
My question is
what is the difference between Association rule mining & frequent itemset mining?
Thanks
An association rule is something like "A,B → C", meaning that C tends to occur when A and B occur. An itemset is just a collection such as "A,B,C", and it is frequent if its items tend to co-occur. The usual way to look for association rules is to find all frequent itemsets and then postprocess them into rules.
The input of frequent itemset mining is :
a transaction database
a minimum support threshold minsup
The output is :
the set of all itemsets appearing in at least minsup transactions. An itemset is just a set of items that is unordered.
The input of assocition rule mining is :
a transaction database
a minimum support threshold minsup
a minimum confidence threshold minconf
The output is :
the set of all valid association rule. An association rule X-->Y is a relationship between two itemsets X and Y such that X and Y are disjoint and are not empty. A valid rule is a rule having a support higher or equals to minsup and a confidence higher or equal to minconf. The support is defined as sup(x-->Y) = sup (X U Y) / (number of transactions). The confidence is defined as conf(x-->Y) = sup (X U Y) / sup (X).
Now the relationship between itemset and association rule mining is that it is very efficient to use the frequent itemset to generate rules (see the paper by Agrawal 1993) for more details about this idea. So association rule mining will be broken down into two steps:
- mining frequent itemsets
- generating all valid association rules by using the frequent itemsets.
Frequent itemset mining is the first step of Association rule mining.
Once you have generated all the frequent itemsets, you proceed by iterating over them, one by one, enumerating through all the possible association rules, calculate their confidence, finally, if the confidence is > minConfidence, you output that rule.
Frequent itemset mining is a step of Association rules mining. After applying Frequent itemset mining algorithm like Apriori, FPGrowth on data, you will get frequent itemsets. From these
discovered frequent itemsets, you will generate association rules(Usually done by subset generation).
By using Association rule mining we will get the frequently itemsets that present in the given dataset. it also provide different types of algorithms for mining the frequent itemsets but it is done in different way that means either horizontal or vertical format. Apriori algorithm follow the horizontal format for mining the frequent itemsets and eclat algorithm follow the vertical format for mining the frequent datasets.
Association Rule mining:
Association rule mining is used to find the patterns in data.it finds the features which occur together and correlated.
Example:
For example, people who buy diapers are likely to buy baby powder. Or we can rephrase the statement by saying: If (people buy diaper), then (they buy baby powder). Note the if, then rule. This does not necessarily mean that if people buy baby powder, they buy diaper. In General, we can say that if condition A tends to B it does not necessarily mean that B tends to A.
Frequent item set mining:
Frequent item set mining is used to find the common item sets in data. it can generate association rules from the given transactional datasets.
Example:
If there are 2 items X and Y purchased frequently then its good to put them together in stores or provide some discount offer on one item on purchase of other item. This can really increase the sales. For example it is likely to find that if a customer buys Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy butter if he/she buys Milk and Bread.