Is the Amount of an Association Rule's Elements a Frequent Itemsets? - data-mining

While generating the association rules for the frequent itemset, is it necessary to maintain the cardinality of the frequent itemset? For example: if the frequent itemset is {a,b,c,d,e}, should the rules for X->Y be generated in such a way that |X| + |Y| = |frequent itemset|?

the amount of an association rule's elements is automatically a frequent itemset itself.
an association rule is obtained from a frequent itemset. every frequent itemset's item is a frequent item. and every combination of frequent itemset's items is again a frequent itemset itself. otherwise, a frequent itemset wouldn't be a frequent itemset.
an association rule consists of a frequent itemset's items and/or combinations of the frequent itemset's items. therefore, the elements of an association rule are frequent items and/or combination of the frequent itemset's items.
hence, the elements' amount of an association rule are a frequent itemset itself because it's extracted from a frequent itemset and all items or combinations of the items of a frequent itemset are frequent items or frequent itemsets itself.
e.g. every item a, b, c, d, e of the frequent itemset {a, b, c, d, e} is a frequent item. every item's combination {a, b}, {a, c}, {a, d}, ..., {d, e} of the frequent itemset {a, b, c, d, e} is a frequent itemset. every amount of an association rule's elements are frequent items or frequent itemsets. therefore, the amount {a, b, c, d} of the asscoiation rule {a, b} => {c, d} extracted from the frequent itemset {a, b, c, d, e} is an frequent itemset.

Related

DAG and Top Sort

"Arranging the vertices of a DAG according to increasing pre-number results in a topological sort." is not a true statement apparently, but I'm not seeing why it isn't. If the graph is directed and doesn't have cycles, then shouldn't the order in which we visit the vertices necessarily be the correct order in which we sort it topologically?
Arranging by increasing pre-number does not guarantee a valid topological sort. Consider this graph:
A
↓
B → C → D
The two valid topological orders of this graph are:
A, B, C, D
B, A, C, D
If you were to visit the nodes beginning with C, one possible pre-number order would be:
C, D, A, B
That is not a valid topological order. An even simpler example is this graph:
B → A
There is clearly one valid topological order, but if we were to visit A first and sort by pre-number, the resulting order would be backwards.

Why Capacity-1 in integer knapsack?

The dynamic programming solution to the integer knapsack problem,
For a knapsack of capacity C, and for n items, where ith item has the size Si and value Vi, is:
M(C)=max(M(C-1), M(C-Si)+Vi), where i goes from 1 to n
Here M is an array. M(C) denotes the maximum value of a knapsack of capacity C.
What is the use of M(C-1) in this relation? I mean the solution should just be this:
M(C)=max(M(C-Si)+Vi), where i goes from 1 to n
I think all the cases that M(C-1) covers are covered in M(C).
If I'm wrong, please give me an example situation.
I think you have to setup of the formula a bit confused - specifically, you've mixed up the capacity of the bag with a sub problem of n-1 items. Let's redefine a bit.
Let P denote the problem, as represented by a list of n items.
Further, let Pk represent the subproblem consisting of items at indices 1...k from the original problem, where 1 <= k <= n. Thus Pn is equivalent to P.
For each item at index i, let Vi denote the value of that item and Si denote the size of that item.
Let C be the capacity of the bag, C >= 0
Let M(Pk, C) denote the optimal solution to the problem described by Pk with a bag of capacity C. M(Pk, C) returns the list of items included in the solution (and thus also returns the value of the optimal solution and the excess capacity in the bag).
For each item, we could either include it in the optimal solution, or not include it in the optimal solution. Clearly, the optimal solution is whichever of these two options is preferable. The only corner case to consider is if the item in question cannot fit in the bag. In this case we must exclude it.
We can rely on recursion to cover every item for us, thus have no need for iteration. Thus all together:
M(Pk,C) = if(Sk > C) M(P(k-1), C) else max(M(P(k-1),C), Vk + M(P(k-1),C-Sk))

Divide lists and increment every element of list by its depth in Prolog

I'm trying to increment every element of list by its depth. For example:
foo([0, 0, [0]], [1, 1, [2]]) -> true
Also, I'd like to do that without any built-in Prolog list predicates. Any ideas on how to solve this?
For divide_half, you need to verify 2 things (which can be done independently): that the lists define some kind of split, and that parts of the split have close enough lengths.
For the second, start by trying to FIND each element, then modify that to track the depth, and finally modify THAT to build up a copy of the list w/ the elements modified (though I don't understand what it means to one-increment something by a number).
This definition divides a list without any built-in predicates:
halve(List, A, B) :- halve(List, List, A, B).
halve(B, [], [], B).
halve(B, [_], [], B).
halve([H|T], [_,_|T2], [H|A], B) :-
halve(T, T2, A, B).

Prolog - how to check if a list includes certain elements?

I am trying out Prolog for the first time and am having a little difficulty using lists.
Say I have a list of elements. I want to check that the list has the following elements:
All of: A1, A2, A3, A4, A5
One of: B1, B2, B3, B4
Two of: C1, C2, C3, C4, C5, C6
For example, [A1, A2, B2, C1, A3, A4, C4, A5] meets the requirements and [A2, A1, C1, B1, A3, A4] does not.
How would I got about writing something that returns Yes/True if a list meets the requirements and No/False otherwise? Similarly, how about writing something that returns the missing values from the list needed to meet the requirements?
You asked a lot of questions! Let me get you started with some predicates that solve most of your requirements.
First let's tackle the case of checking that all entries of one list are also in the other list:
subset([ ],_).
subset([H|T],List) :-
member(H,List),
subset(T,List).
This simple recursion makes use of the familiar member/2 predicate to verify each entry in the list specified by the first argument of subset/2 is also in the list specified by the second argument. [For simplicity I've assumed that entries of these list are distinct. A more elaborate version would be needed if we wanted to verify multiple instances of an entry of the first list are matched to at least that many instances in the second list.]
Okay, how about a check that (at least) one of a first list belongs also to the second list? Obviously this is a different predicate than the one above. Instead of all items in the first list, the goal is to be satisfied if there exists any one item in the first list that belongs to the second list.
intersects([H|_],List) :-
member(H,List),
!.
intersects([_|T],List) :-
intersects(T,List).
This recursion fails if it reaches an empty list for the first argument, but succeeds if at any point before that a member of the first list is found that belongs to the second list. [This predicate would work fine even if multiple instances of an item occur in either list. However we'd need to refine the logic if we wanted check exactly one item of the first list belongs to the second list, and that would entail worrying about whether multiple instances are consistent with or counter to the exact count of one.]
What if we want to generalize this check, to verify (at least) N items of the first list are in the second one? The resulting predicate will require a third argument:
intersectsAtLeast(_,_,N) :- N <= 0, !.
intersectsAtLeast([H|T],L,N) :-
member(H,L),
!,
M is N-1,
intersectsAtLeast(T,L,M).
intersectsAtLeast([_|T],L,N) :-
intersectsAtLeast(T,L,N).
This recursion works through the list, decrementing the third argument by one each time an item on the first list turns out to be in the second list as well, and succeeding once the count is reduced to 0 (or less). [Again the code here needs more work if the lists can have repetitions.]
Finally you ask about writing something that "returns the missing values" need to meet requirements. This is not well defined in the case of checking for one or more items on both lists, since a "missing value" might be any one of a number of possible items. In the special case where we asked for all the items on the first list to belong to the second list, the "missing values" can be determined (if any).
missingValues([ ],_,[ ]).
missingValues([H|T],L,K) :-
member(H,L),
!,
missingValues(T,L,K).
missingValues([H|T],L,[H|K]) :-
missingValues(T,L,K).
Here the recursion "moves" items from the input first list to the output "missing items" third list if and only if they do not appear in the second list.
One final note about your questions concerns notation. In Prolog variables are identifiers that start with a capital letter or an underscore, so the use of A1, A2, etc. as items on the list is heading for trouble if those are treated as "unknowns" rather than (as I assume you meant) distinct atoms (constants). Switching to lowercase letters would solve that.

Split up a collection, for each subset respecting probabilities for properties of its items

For a small game (for which I am a bit forced to use C++, so STL-based solutions can be interesting here), I encountered following neat problem. I was wondering if there is any literature on the subject that I could read, or clever implementations.
Collection S of unique items {E1, E2, E3}, each item E having a set of properties, {P1, P2, P3...}
This collection should be split up in S1, S2, S3, S4. It is defined how large S1..4 have to be exactly. We can assume the collection can be correctly split up in those sizes for the remainder of the problem.
Now, for S1, a number of constraints can appear, {C1, C2..}, which specify that for instance, no items with the property P1 may appear in it. Another constraint could be that it should favour the items with property P2 with a factor of 0.8 (we can assume these types of constraints are normalized for all of the subsets per property).
The "weighting" is not that hard to implement. I simply fill some array with candidate numbers, the ones with higher weight are represented more in this array. I then select a random element of the array. the size of the array determines accuracy/granularity (in my case, a small array suffices).
The problem is forbidding some items to appear. It can easily lead to a situation where one item in S needs to be placed in one of the subsets S1, S2, S3, or S4, but this can no longer happen because the subsets are either all full, or the ones that are not full have a specific constraint that this item cannot appear in the set. So you have to backtrack the placement. Doing so too often may violate the weighted probability too much.
How is this problem called, or does it easily map to another (probably NP) problem?
EDIT: Example:
S = {A, B, C, D, E, F, G, H, I, J, K, L, M }
S1 = [ 0.8 probability of having VOWEL, CANNOT HAVE I or K, SIZE = 6 ]
S2 = [ 0.2 probability of having VOWEL, CANNOT HAVE M, B, E, SIZE = 7 ]
Now, suppose we start filling by FOR(LETTER IN S):
LETTER A, create a fill array based on property constraints (0.8 vs 0.2):
[ 1, 1, 1, 1, 1, 1, 1, 2, 2].
Pick a random element from that array: 1.
Now, put A in S1.
For letter I, for instance, the only candidate would be 2, since S1 has a constraint that I cannot appear in it.
Keep doing this, eventually you might end up with:
C = { M } // one more letter to distribute
S1 = A, B, D, E, F, G
S2 = C, F, G, I, K, L
Now, where to place M? I tcannot be placed in S1, since that one is full, and it cannot be placed in S2 because it has a constraint that M cannot be placed in it.
The only way is to backtrack some placement, but then we might mess with the weighted distribution too much (f.i., giving S2 one vowel of S1, which flips around the natural distribution)
Note that this become slightly more complex (in the sense that more backtracks would be needed) when more subsets are in play, instead of just 2.
This has resemblance to a constraint satisfaction problem (CSP) with hard and soft constraints. There are a couple of standard algorithms for that, but you have to check, if they apply to your particular problem instance.
Check wikipedia for starters.
How about this heuristic:
1 Taking into consideration limitations due to constraints and full sets, locate any elements that only meet the criteria for a single set and place them there. If at any point, one of these insertion causes a set to become full, re-evaluate the elements for meeting the criteria for only a single set.
2 Now look only at elements that could fit in exactly two sets. For each element compute the differences in the required probabilities for each set if you added that element vs if you did not. Insert the element into the set where the insert results in the best short term result (first fit / greedy algorithm). If an insert fills up a set, re-evaluate the elements for meeting the criteria for only two sets
3 Continue for elements that fit in 3 sets, 4 sets ... n sets.
At this point all elements will be placed into sets meeting all the constraints, but the probabilities are probably not optimal. You could continue by swapping elements between the sets (only allowing swaps that don't violate constraints), by using a gradient descent or random-restart hill clibing algorithm on a function describing the how closely all the probability's are met. This will tend to converge towards the optimal solution but is not guaranteed to meet it. Continue until you meet your requirements to within an acceptable amount, or until a fixed time limit is met, or until the improvements possible is below a set threshold.