What is the strong association rule if there is only candidate 1-Itemsets possible after using Apriori algorithm? - data-mining

Well for a question if I solve it I get only candidate 1-Itemsets after removing itemsets of low support. Now for the strong association rule for example can I take it as Pencil->Pencil and write it as the answer?

Related

How to apply a maximum support threshold in Apriori algorihtm?

I have an assignment question:
Suppose you are given a task of finding all the association rules in a database whose
supports are between 20% and 80%, and the accuracy of the rules should be above 70%. Change the algorithm scheme of Apriori to find all the rules satisfying the above requirement.
I know that to get rules with minimum support of 20%, I remove itemsets with less than 20% support from candidate sets. But I am confused about how to apply a maximum threshold on support.
I can't remove the itemsets with support greater than 80% as it will remove itemsets with support within the allowed range down the steps.

Job scheduling to minimise loss

I have got a job scheduling problem. We are given start time, time to
complete the order, deadline.
It is given that start time + time to
complete <= deadline.
I have also been given the loss that will occur if I am not able to
complete the job before the deadline. I have to design an algorithm to minimize the loss.
I have tried changing the standard algorithm of dynamic programming for maximizing the profit in job scheduling but to no success.
What algorithm can I use to solve the question?
Dynamic Programming isn't the right approach based on what you're aiming to optimize. You can find the optimized schedule by using a greedy approach.
Here's a thorough guide with sample code for your desire language (C++), in the guide it assumes each jobs takes only 1 unit of time, which you can easily modify by using time_to_complete instead.
Your problem is similar to the knapsack one. Using a greedy approach is convenient if you aren't actually looking for the best possible solution, but just a "good enough" one.
The big pro of the greedy approach is that the cost is rather lower than other "more thorough" approaches but, if you need the best solution to your problem, I would say that backtracking is the way to go.
Since the deadline can be violated, the problems looks like a Total Weighted Tardiness Scheduling Problem. There are many flavors of it, but most problems under this umbrella are computationally hard, therefore Dynamic Programming (DP) would not be my first choice. In my experience, DP also poses difficulties during modeling and implementation. Same comment for mathematical programming "as-is". Some approaches that can be implemented more quickly are:
constraint programming: very small learning curve, and there are many libraries out there, included very good open source ones (most have C++ API). Bonus: constraint programming can demonstrate optimality.
ad hoc heuristics: (1) start with a constructive algorithm (like the greedy approach suggested by Ling Zhong and Flavio Giobergia), then (2) use some local search approach to improve if and finally (3) embed the approach into a metaheuristic scheme. This way you can build on top of the previous step, and learn a lot about the problem. Note: in general, heuristics cannot demonstrate optimality
special mention: local solver, a hybrid approach between the two above: it lets you model the problem using a formalism similar to constraint programming and then it solves it using heuristics. It is very easy to learn, it usually lets you get started quickly and, in my tests, it provides remarkably good results.

Common patterns in a database

I need to find common patterns in a database of sequences of events. So, I have considered the longest common substring problem and the python implementation searching for a solution.
Note that I am not searching for the longest common substring only: I accept shorter common substrings appearing frequently in the database.
Can you suggest some algorithm, implementation tricks or general advice about this problem?
The previous answer suggested Apriori. But Apriori is inappropriate if you want to find frequent sequences because Apriori does not consider the time (also, Apriori is an inefficient algorithm).
If you want to find subsequences that are common to several sequences, it would be more appropriate to use a sequential pattern mining algorithm such as PrefixSpan and SPAM.
If you want to make some predictions, another option would also be to use a sequential rule mining algorithm.
I have open-source Java implementations of sequential pattern mining and sequential rule mining algorithm that you can download from my website: http://www.philippe-fournier-viger.com/spmf/
I don't think that you could process 8 GB of data in one shot with these algorithms. But it could be a starting point. Actually, some of these algorithms could be adapted for the case of very large databases by implementing a disk-based strategy.
Have you considered Frequent Itemset Mining methods such as Apriori?

Building an Intrusion Detection System using fuzzy logic

I want to develop an Intrusion Detection System (IDS) that might be used with one of the KDD datasets. In the present case, my dataset has 42 attributes and more than 4,000,000 rows of data.
I am trying to build my IDS using fuzzy association rules, hence my question: What is actually considered as the best tool for fuzzy logic in this context?
Fuzzy association rule algorithms are often extensions of normal association rule algorithms like Apriori and FP-growth in order to model uncertainty using probability ranges. I thus assume that your data consists of quite uncertain measurements and therefore you want to group the measurements in more general ranges like e.g. 'low'/'medium'/'high'. From there on you can use any normal association rule algorithm to find the rules for your IDS (I'd suggest FP-growth as it has lower complexity than Apriori for large data sets).

what is the difference between Association rule mining & frequent itemset mining

i am new to data mining and confuse about Association rules and frequent item mining. for me i think both are same but i need views from experts on this forum
My question is
what is the difference between Association rule mining & frequent itemset mining?
Thanks
An association rule is something like "A,B → C", meaning that C tends to occur when A and B occur. An itemset is just a collection such as "A,B,C", and it is frequent if its items tend to co-occur. The usual way to look for association rules is to find all frequent itemsets and then postprocess them into rules.
The input of frequent itemset mining is :
a transaction database
a minimum support threshold minsup
The output is :
the set of all itemsets appearing in at least minsup transactions. An itemset is just a set of items that is unordered.
The input of assocition rule mining is :
a transaction database
a minimum support threshold minsup
a minimum confidence threshold minconf
The output is :
the set of all valid association rule. An association rule X-->Y is a relationship between two itemsets X and Y such that X and Y are disjoint and are not empty. A valid rule is a rule having a support higher or equals to minsup and a confidence higher or equal to minconf. The support is defined as sup(x-->Y) = sup (X U Y) / (number of transactions). The confidence is defined as conf(x-->Y) = sup (X U Y) / sup (X).
Now the relationship between itemset and association rule mining is that it is very efficient to use the frequent itemset to generate rules (see the paper by Agrawal 1993) for more details about this idea. So association rule mining will be broken down into two steps:
- mining frequent itemsets
- generating all valid association rules by using the frequent itemsets.
Frequent itemset mining is the first step of Association rule mining.
Once you have generated all the frequent itemsets, you proceed by iterating over them, one by one, enumerating through all the possible association rules, calculate their confidence, finally, if the confidence is > minConfidence, you output that rule.
Frequent itemset mining is a step of Association rules mining. After applying Frequent itemset mining algorithm like Apriori, FPGrowth on data, you will get frequent itemsets. From these
discovered frequent itemsets, you will generate association rules(Usually done by subset generation).
By using Association rule mining we will get the frequently itemsets that present in the given dataset. it also provide different types of algorithms for mining the frequent itemsets but it is done in different way that means either horizontal or vertical format. Apriori algorithm follow the horizontal format for mining the frequent itemsets and eclat algorithm follow the vertical format for mining the frequent datasets.
Association Rule mining:
Association rule mining is used to find the patterns in data.it finds the features which occur together and correlated.
Example:
For example, people who buy diapers are likely to buy baby powder. Or we can rephrase the statement by saying: If (people buy diaper), then (they buy baby powder). Note the if, then rule. This does not necessarily mean that if people buy baby powder, they buy diaper. In General, we can say that if condition A tends to B it does not necessarily mean that B tends to A.
Frequent item set mining:
Frequent item set mining is used to find the common item sets in data. it can generate association rules from the given transactional datasets.
Example:
If there are 2 items X and Y purchased frequently then its good to put them together in stores or provide some discount offer on one item on purchase of other item. This can really increase the sales. For example it is likely to find that if a customer buys Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy butter if he/she buys Milk and Bread.