I am currently operating a specific project for my university.
What I will be doing in the project is building a cross-selling model with association rule mining.
In the result, I have tons of rules but I am not sure how to rank them which would be the best.
Which option would be better if
Option 1: Confidence=20% Lift= 5
Option 2: Confidence = 50% Lift = 2
I know confidence is important, but I have heard Lift is very important as well. Should I be sacrificing some confidence for more lift or keep it balance?
it depends on what the aim is of the association rule mining is:
e.g.:
- 100.000 transactions' database
- 2.000 tranasctions contain {(a, b)}
- 800 transactions contain {(a, b, c)}
support of itemset {(a, b, c)}: (800 / 100.000) * 100 = 0,8%.
the support of an itemset indicates how often a random transaction of the database contains the items of the itemset.
confidence of association rule {(a, b)} -> {(c)}: (800 / 2000) * 100 = 40%.
the confidence of an association rule indicates how often a random transaction of the dabase that contains the consequent of an association rule also contains the ancedent of the association rules.
lift of association rule {(a, b)} -> {(c)}: 40 / ((5.000 / 100.000) * 100) = 8.
the lift is the ratio of the confidence to the expected confidence of an association rule. the confidence of the association rule is 40%. expected confidence in this context means that if {(a, b)} occurs in a transaction that this does not increases the pobability of that {(c)} occurs in this transaction as well.
e.g. if {(c)} occurs in 5.000 transactions of the database then the expected confidence is (100.000 / 5.000) * 100 = 5%.
a lift value of an asscoiation rule which is higher then 1 indicates that the association rule is useful. a lift value less or equal 1 indicates that the association rule is not useful. in this case it is like the antecedent and the consequent of the association rule are independent of each other. the usefulness of the indication of the association rule that if a transaction holds ({a, b}) that it then associates ({c}) is not more useful then that ({a, b}) accociates ({c}) by chance.
e.g. if all 100.000 transactions of the database contain {(c)} the expected value of {(c)} is (100.000 / 100.000) * 100 = 100%. the lift is 40 / 100 = 0,4. this is less then 1. therefore the association rule {(a, b)} -> {(c)} is not useful. {(c)} is in every transaction. if there is {(a, b)} in a transaction there is {(c)} in it either way. no use of an association.
here the circle closes: it depends on the aim of the association rule mining. if the aim is to create extra strong association rules the confidence needs to be extra high. if the purpose is to create extra useful asscociation rules the lift needs to be extra high.
Related
What does the Type mean & what do Type 0,1,2 signifies in Ethers.js?
I only found https://eips.ethereum.org/EIPS/eip-2718#transactions this
as relevant. can someone please help me out on this one?
hash : ' 0x3dd67f90200eb9902fb64ee
type : 2 ,
accessList : [ ] ,
blockHash : ' 0xe295188fa2aa6e9a5b
blockNumber : 15472730 ,
transaction Index : 63 ,
confirmations : 2 ,
from : ' 0x227c4Bf958A9Ab93EA04ee
From: transaction-types-on-ethereum
Ethereum has different transaction types that specify what a
transaction should do. For example, sending Ether to an address,
deploying a contract, etc.
Before the recent Berlin network upgrade, there were essentially four
different transaction "types" in Ethereum
Previously, Ethereum had one format for transactions. Each transaction consists of a nonce, gas price, gas limit, to address, value, data, v, r, and s. These fields are RLP-encoded, to look something like this:
RLP([nonce, gasPrice, gasLimit, to, value, data, v, r, s])
EIP-2718 defines a new generalised envelope for typed transactions. In
the new standard, transactions look like this:
TransactionType || TransactionPayload
TransactionType: a number between 0 and 0x7f, for a total of 128
possible transaction types.
This is the use case:
This new approach makes it possible for new EIPs to introduce a
transaction type without introducing more unnecessary complexity in
the existing transaction format, and it makes it a lot easier for the
different Ethereum tools (clients, libraries) to distinguish between
different transactions.
Read about TransactionType-2 what-does-eip-1559-change/
The response to the high and unpredictable transaction fees in
Ethereum was EIP-1559. It introduced a new concept – the base cost of
gas – a fixed value, smoothly changing depending on the network load
(if the expected transaction numbers are low, the base cost smoothly
decreases, as soon as there are more of them – smoothly increases).
From now on, whenever a new type of transaction is sent, the base cost
of gas multiplied by the amount of gas spent is always used for
calculation, no more and no less. This eliminates the possibility of
errors with sending transactions with disproportionately large
commissions, and it is easier for the wallet programs to calculate the
cost of commissions, as they are formed according to a fairly simple
and predictable scheme. The new type of transaction is called a “type
2 transaction,” and canonical transactions are called a “type 0
transaction”
Can someone help me interpret the AWS Personalize solution version metrics in layman’s terms or, at the very least, tell me what these metrics should ideally look like?
I have no knowledge of Machine Learning and wanted to take advantage of Personalize as it is marketed as a 'no-previous-knowledge-required' ML SaaS. However, the “Solution version metrics” in my solution results seem to require a fairly high level of math knowledge.
My Solution version metrics are as follows:
Normalized discounted cumulative
At 5: 0.9881, At 10: 0.9890, At 25: 0.9898
Precision
At 5: 0.1981, At 10: 0.0993, At 25: 0.0399
Mean reciprocal rank
At 25: 0.9833
Research
I have looked through the Personalize Developer's Guide which includes a short definition of each metric on page 72. I also attempted to skim through the Wikipedia articles on discounted cumulative gain and mean reciprocal rank. From reading, this is my interpretation of each metric:
NDG = Consistency of relevance of recommendations; Is the first recommendation as relevant as the last?
Precision = Relevance of recommendations to user; How relevant are your recommendations to users across the board?
MRR = Relevance of first recommendation in the list versus the others in the list; How relevant is your first recommendation to each user?
If these interpretations are right, then my solution metrics indicate that I am highly consistent about recommending irrelevant content. Is that a valid conclusion?
Alright, my company has Developer Tier Support so I was able to get an answer to this question from AWS.
Answer Summary
The metrics are better the closer they are to '1'. My interpretation of my metrics was pretty much correct but my conclusion was not.
Apparently, these metrics (and Personalize in general) do not take into account how much a user likes an item. Personalize only cares how soon a relevant recommendation gets to the user. This makes sense because if you get the 25th item in a queue and don't like anything you've seen, you are not likely to continue looking.
Given this, what's happening in my solution is that the first-ish recommendation is relevant but none of the others are.
Detailed Answer from AWS
I will start with relatively easier question first: What are the ideal values for these metrics, so that a solution version can be preferred over another solution version?
The answer to the above question is that for each metric, higher numbers are better. [1] If you have more than one solution version, please prefer the solution version with higher values for these metrics. Please note that you can create number of solution versions by Overriding Default Recipe Parameters [2]. And by using Hyperparameters [3].
The second question: How to understand and interpret the metrics for AWS Personalize Solution version?
I can confirm from my research that the definitions and interpretation provided for these metrics in the case by you are valid.
Before I explain each metric, here is a primer for one of the main concept in Machine Learning. How these metrics are calculated?
The Model training step during the creation of solution version splits the input dataset into two parts, a training dataset (~70%) and test dataset (~30%). The training dataset is used during the Model training. Once the model is trained, it is used to predict the values for test dataset. Once the prediction is made it is validated against the known (and correct) value in the test dataset. [4]
I researched further to find more resources to understand the concept behind these metrics and also elaborate further an example provided in the AWS documentation. [1]
"mean_reciprocal_rank_at_25"
Let’s first understand Reciprocal Rank:
For example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e A, B, C, D, E.
Once these 5 recommended movies are compared against the actual movies liked by that user (in the test dataset) we find out that only movie B and E are actually liked by the user.
The Reciprocal Rank will only consider the first relevant (correct according to test dataset) recommendation which is movie B located at rank 2 and it will ignore the movie E located at rank 5. Thus the Reciprocal Rank will be 1/2 = 0.5
Now let’s expand the above example to understand Mean Reciprocal Rank: [5] Let’s assume that we ran predictions for three users and below movies were recommended.
User 1: A, B, C, D, E (user liked B and E, thus the Reciprocal Rank is 1/2)
User 2: F, G, H, I, J (user liked H and I, thus the Reciprocal Rank is 1/3)
User 3: K, L, M, N, O (user liked K, M and N, thus the Reciprocal Rank is 1)
The Mean Reciprocal Rank will be sum of all the individual Reciprocal Ranks divided by the total number of queries ran for predictions, which is 3.
(1/2 + 1/3 + 1)/3 = (0.5+0.33+1)/3 = (1.83)/3 = 0.61
In case of AWS Personalize Solution version metrics, the mean of the reciprocal ranks of the first relevant recommendation out of the top 25 recommendations over all queries is called “mean_reciprocal_rank_at_25”.
"precision_at_K"
It can be stated as the capability of a model for delivering the relevant elements with the least amount of recommendations.
The concept of precision is described in the following free video available at Coursera. [6] A very good article on the same topic can be found here. [7]
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user.
The precision_at_5 will be 2 correctly predicted movies out of total 5 movies and can be stated as 2/5=0.4
"normalized_discounted_cumulative_gain_at_K"
This metric use the concept of Logarithm and Logarithmic Scale to assign weighting factor to relevant items (correct values in the test dataset). The full description of Logarithm and Logarithmic Scale is beyond the scope of this document. The main objective of using Logarithmic scale is to reduce wide-ranging quantities to tiny scopes.
discounted_cumulative_gain_at_K
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user.
To produce the cumulative discounted gain (DCG) at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position)
As B is at position 2 so the discounted value is = 1/log(1 + 2)
As E is at position 5 so the discounted value is = 1/log(1 + 5)
The cumulative discounted gain (DCG) is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 2) + 1/log(1 + 5) )
normalized_discounted_cumulative_gain_at_K
First of all, what is “ideal DCG”?
In the above example the ideal predictions should look like B, E, A, C, D. Thus the relevant items should be at number 1 and 2 in ideal case. To produce the “ideal DCG” at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position).
As B is at position 1 so the discounted value is = 1/log(1 + 1)
As E is at position 2 so the discounted value is = 1/log(1 + 2)
The ideal DCG is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 1) + 1/log(1 + 2) )
The normalized discounted cumulative gain (NDCG) is the DCG divided by the “ideal DCG”.
DCG / ideal DCG = (1/log(1 + 2) + 1/log(1 + 5)) / (1/log(1 + 1) + 1/log(1 + 2)) = 0.6241
I hope the information provided above is helpful in understanding the concept behind these metrics.
[1] https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html
[2] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config.html
[3] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html
[4] https://medium.com/#m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
[5] https://www.blabladata.com/2014/10/26/evaluating-recommender-systems/
[6] https://www.coursera.org/lecture/ml-foundations/optimal-recommenders-4EQc2
[7] https://medium.com/#bond.kirill.alexandrovich/precision-and-recall-in-recommender-systems-and-some-metrics-stuff-ca2ad385c5f8
I analysis database of supermarket by association rules algorithm although, min confidence(0.04) and min support(0.002) is low but result that got them is trivial rule ( fresh items that bought daily) for example:
Tomato --> Cucumber
Milk --> eggs
I don’t thing this rules may be benefit for any thing.
I use sql server business intelligence for analysis.
Is it possible that my database can not help me in the forecasting or other problem
Confidence, by itself, does not tell you much. You also need to consider lift (how much more likely are cucumbers when tomato is present than when tomato is not present). You can also use the chi-square value calculated from a contingency table with counts of cucumbers when tomatoes are present and not present. A higher chi-square value generally indicates a more interesting rule.
I am trying to come up with an algorithm to find top-3 most frequently used adjectives for the product in the same sentence. I want to use association rule mining(Apriori algorithm).
For that I am planning of using the twitter data. I can more or less decompose twits in to sentences and then with filtering I can find product names and adjectives with it.
For instance, after filtering I have data like;
ipad mini, great
ipad mini, horrible
samsung galaxy s2, best
...
etc.
Product names and adjectives are previously defined. So I have a set of product names and set of adjectives that I am looking for.
I have read couple of papers about sentimental analysis and rule mining and they all say Apriori algorithm is used. But they don't say how they used it and they don't give details.
Therefore how can I reduce my problem to association rule mining problem?
What values should I use for minsup and minconf?
How can I modify Apriori algorithm to solve this problem?
What I' m thinking is;
I should find frequent adjectives separately for each product. Then by sorting I can get top-3 adjectives. But I do not know if it is correct.
Finding the top-3 most used adjectives for each product is not association rule mining.
For Apriori to yield good results, you must be interested in itemsets of length 4 and more. Apriori pruning starts at length 3, and begins to yield major gains at length 4. At length 2, it is mostly enumerating all pairs. And if you are only interested in pairs (product, adjective), then apriori is doing much more work than necessary.
Instead, use counting. Use hash tables. If you really have Exabytes of data, use approximate counting and heavy hitter algorithms. (But most likely, you don't have exabytes of data after extracting those pairs...)
Don't bother to investigate association rule mining if you only need to solve this much simpler problem.
Association rule mining is really only for finding patterns such as
pasta, tomato, onion -> basil
and more complex rules. The contribution of Apriori is to reduce the number of candidates when going from length n-1 -> n for length n > 2. And it gets more effective when n > 3.
Reducing your problem to Association Rule Mining (ARM)
Create a feature vector having all the topics and adjectives. If a feed contains topic then place 1 for it else 0 in tuple. For eg. Let us assume Topics are Samsung and Apple. And Adjectives are good and horrible. And feed contains Samsung good. Then corresponding tuple for it is :
Samsung Apple good horrible
1 0 1 0
Modification to Apriori Algorithm required
generate Association Rules of type 'topic' --> 'adjective' using constrained apriori algorithm. 'topic' --> 'adjective' is a constraint.
How to set MinSup and MinConf :
Read a paper entitled "Minin top-k association rules". Implement that with k=3 for 3 top adjectives.
I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.
My question is
(c) A transactional dataset, T, is given below:
t1: Milk, Chicken, Beer
t2: Chicken, Cheese
t3: Cheese, Boots
t4: Cheese, Chicken, Beer,
t5: Chicken, Beer, Clothes, Cheese, Milk
t6: Clothes, Beer, Milk
t7: Beer, Milk, Clothes
Assume that minimum support is 0.5 (minsup = 0.5).
(i) Find all frequent itemsets.
Here is how I worked it out:
Item : Amount
Milk : 4
Chicken : 4
Beer : 5
Cheese : 4
Boots : 1
Clothes : 3
Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:
{items} : Amount
{Milk, Chicken} : 2
{Milk, Beer} : 4
{Milk, Cheese} : 1
{Chicken, Beer} : 3
{Chicken, Cheese} : 3
{Beer, Cheese} : 2
Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?
I agree you should go for the Apriori Algorithm.
The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent.
If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.
So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.
The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears).
This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.
If item y has a frequency greater than X, it is said to be frequent.
In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that
we compute only for items that are individually frequent. So if item y and item z are frequent on itselves,
we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.
Once this is calculated, the frequencies greater than the threshold are said frequent itemset.
(http://girlincomputerscience.blogspot.com.br/2013/01/frequent-itemset-problem-for-mapreduce.html)
There are two ways to solve the problem:
using Apriori algorithm
Using FP counting
Assuming that you are using Apriori, the answer you got is correct.
The algorithm is simple:
First you count frequent 1-item sets and exclude the item-sets below minimum support.
Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.
The algorithm can go on until no item-sets are greater than threshold.
In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.
There is a solved example of further steps on Wikipedia here.
You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.
OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database. Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.
Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule. Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.
Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).
I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.