Sentiment analysis with association rule mining - data-mining
I am trying to come up with an algorithm to find top-3 most frequently used adjectives for the product in the same sentence. I want to use association rule mining(Apriori algorithm).
For that I am planning of using the twitter data. I can more or less decompose twits in to sentences and then with filtering I can find product names and adjectives with it.
For instance, after filtering I have data like;
ipad mini, great
ipad mini, horrible
samsung galaxy s2, best
...
etc.
Product names and adjectives are previously defined. So I have a set of product names and set of adjectives that I am looking for.
I have read couple of papers about sentimental analysis and rule mining and they all say Apriori algorithm is used. But they don't say how they used it and they don't give details.
Therefore how can I reduce my problem to association rule mining problem?
What values should I use for minsup and minconf?
How can I modify Apriori algorithm to solve this problem?
What I' m thinking is;
I should find frequent adjectives separately for each product. Then by sorting I can get top-3 adjectives. But I do not know if it is correct.
Finding the top-3 most used adjectives for each product is not association rule mining.
For Apriori to yield good results, you must be interested in itemsets of length 4 and more. Apriori pruning starts at length 3, and begins to yield major gains at length 4. At length 2, it is mostly enumerating all pairs. And if you are only interested in pairs (product, adjective), then apriori is doing much more work than necessary.
Instead, use counting. Use hash tables. If you really have Exabytes of data, use approximate counting and heavy hitter algorithms. (But most likely, you don't have exabytes of data after extracting those pairs...)
Don't bother to investigate association rule mining if you only need to solve this much simpler problem.
Association rule mining is really only for finding patterns such as
pasta, tomato, onion -> basil
and more complex rules. The contribution of Apriori is to reduce the number of candidates when going from length n-1 -> n for length n > 2. And it gets more effective when n > 3.
Reducing your problem to Association Rule Mining (ARM)
Create a feature vector having all the topics and adjectives. If a feed contains topic then place 1 for it else 0 in tuple. For eg. Let us assume Topics are Samsung and Apple. And Adjectives are good and horrible. And feed contains Samsung good. Then corresponding tuple for it is :
Samsung Apple good horrible
1 0 1 0
Modification to Apriori Algorithm required
generate Association Rules of type 'topic' --> 'adjective' using constrained apriori algorithm. 'topic' --> 'adjective' is a constraint.
How to set MinSup and MinConf :
Read a paper entitled "Minin top-k association rules". Implement that with k=3 for 3 top adjectives.
Related
Classify K-means in Text Mining
The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world: Taking a look at the centroid table results I want to Understand the following: https://ibb.co/n1mvnbk I used K=5 and I am using TF-IDF Explain what those numbers mean? When an attribute is zero in multiple clusters, what does it mean? When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster? How can I discuss the clustering model Do all the clusters make sense and why? Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them. Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation. You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster. Note that for text you'll want to use spherical k-means rather than regular k-means. Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).
Association rule mining
I have a dataset with mostly integer values. I want to apply association rule mining on it. I have taken a look at the popular algorithms like Apriori, etc. but all of them work on data which have boolean values, i.e., either the item exists in the transaction or doesn't. Is there an algorithm which lets us account for values of the attributes in addition to their counts? (I plan to normalize the data to have values between 0 and 1)
You can "hack" around this limitation if your nubers are integer (why normalize to 0 1?) and small: apple banana apple becomes apple banana apple_2 which would allow to find association rules like banana => apple, apple_2 but you need to mix in some clever filters to not get useless rules like apple_2 => apple
Item-item collaborative filtering is quite similar to similarity-based data mining techniques like association rule mining. Moreover, collaborative filtering was built to handle continuous and ordinal values, such as star ratings or a Likert scale: this is usually preference information from users. Content-based filtering is probably your best bet for the situation you describe. It allows for item attributes and weights (that do not change per user for that item), then takes in user preference for each item (that does change per user for that item). If you want both preference (counts) and attributes to change for each user-item pair, I don't know of an algorithm that handles that. Usually algorithms are built for one input per user-item pair.
Yes. There are some variations of the itemset mining problem that will let you specify additional information. For example, high utility itemset mining algorithms let you specify a quantity for each item occuring in a transaction, as well as a weight for each item.
Difference between GSP and the General Apriori method
GSP Algorithm is an Apriori based method with some enhancements. After reading several description, I still could not figure out the enhancements brought by GSP in regards to the general Apriori algorithm. Is it the itemset order that is taken into account ? Could you give me an example as I am a newbie in data mining. Thank you in advance.
Apriori is to find frequent itemsets in transactions. A transaction is just an unordered set of items. Apriori will output patterns that are set of items. GSP is to find frequent sequential patterns in a sequences. A sequence is an ordered list of transactions. GSP will output patterns that are subsequences. If you want to try Apriori and GSP, you can get the Java source code in the SPMF open source data mining library.
The difference between the two is that Apriori is for itemset mining and GSP is for sequence mining. It is based on Apriori but takes into account the order of the items and thus finds sequences. Hence abc is different from cba for example.
GSP is an Apriori-based method in the sequential pattern mining like AprioriAll. GSP add some properties which tends to solve AprioriAll limitations. They are "adding time constraint", "Sliding window time", and "taxonomies". You can find complete explanation here: http://www.philippe-fournier-viger.com/spmf/GSP96.pdf
Data Mining and Frequent Datasets
I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct. My question is (c) A transactional dataset, T, is given below: t1: Milk, Chicken, Beer t2: Chicken, Cheese t3: Cheese, Boots t4: Cheese, Chicken, Beer, t5: Chicken, Beer, Clothes, Cheese, Milk t6: Clothes, Beer, Milk t7: Beer, Milk, Clothes Assume that minimum support is 0.5 (minsup = 0.5). (i) Find all frequent itemsets. Here is how I worked it out: Item : Amount Milk : 4 Chicken : 4 Beer : 5 Cheese : 4 Boots : 1 Clothes : 3 Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving: {items} : Amount {Milk, Chicken} : 2 {Milk, Beer} : 4 {Milk, Cheese} : 1 {Chicken, Beer} : 3 {Chicken, Cheese} : 3 {Beer, Cheese} : 2 Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?
I agree you should go for the Apriori Algorithm. The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent. If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup. So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent. The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears). This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y. If item y has a frequency greater than X, it is said to be frequent. In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that we compute only for items that are individually frequent. So if item y and item z are frequent on itselves, we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken. Once this is calculated, the frequencies greater than the threshold are said frequent itemset. (http://girlincomputerscience.blogspot.com.br/2013/01/frequent-itemset-problem-for-mapreduce.html)
There are two ways to solve the problem: using Apriori algorithm Using FP counting Assuming that you are using Apriori, the answer you got is correct. The algorithm is simple: First you count frequent 1-item sets and exclude the item-sets below minimum support. Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold. The algorithm can go on until no item-sets are greater than threshold. In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further. There is a solved example of further steps on Wikipedia here. You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.
OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database. Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies. Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule. Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules. Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf). I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.
Improving classification results with Weka J48 and Naive Bayes Multinomial classifiers
I have been using Weka’s J48 and Naive Bayes Multinomial (NBM) classifiers upon frequencies of keywords in RSS feeds to classify the feeds into target categories. For example, one of my .arff files contains the following data extracts: #attribute Keyword_1_nasa_Frequency numeric #attribute Keyword_2_fish_Frequency numeric #attribute Keyword_3_kill_Frequency numeric #attribute Keyword_4_show_Frequency numeric … #attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S} #data 0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE 0,0,0,12,0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE 0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE … 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FCL 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,F … 20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,7 8,0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284, 221,72,0,72,0,SNT 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,SNT 0,0,0,0,0,0,11,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,0,0,0,20,0,S And so on: there’s a total of 570 rows where each one is contains with the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for 10 days giving a total of 570 records to be classified. Each keyword is prefixed with a surrogate number and postfixed with ‘Frequency’. I am using 10 fold x validation for both the J48s and NBM classifiers on a 'black box' basis. Other parameters used are also defaults, i.e. 0.25 confidence and min number of objects is 2 for the J48s. So far, my classification rates for an instance of varying numbers of days, date ranges and actual keyword frequencies with both J28 and NBM results being consistent in the 50 - 60% range. But, I would like to improve this if possible. I have reduced the decision tree confidence level, sometimes as low as 0.1 but the improvements are very marginal. Can anyone suggest any other way of improving my results? To give more information, the basic process here involves a diverse collection of RSS feeds where each one belongs to a single category. For a given date range, e.g. 01 - 10 Sep 2011, the text of each feed's item elements are combined. The text is then validated to remove words with numbers, accents and so on, and stop words (a list of 500 stop words from MySQL is used). The remaining text is then indexed in Lucene to work out the most popular 64 words. Each of these 64 words is then searched for in the description elements of the feeds for each day within the given date range. As part of this, the description text is also validated in the same way as the title text and again indexed by Lucene. So a popular keyword from the title such as 'declines' is stemmed to 'declin': then if any similar words are found in the description elements which also stem to 'declin', such as 'declined', the frequency for 'declin' is taken from Lucene's indexing of the word from the description elements. The frequencies shown in the .arff file match on this basis, i.e. on the first line above, 'nasa', 'fish', 'kill' are not found in the description items of a particular feed in the BFE category for that day, but 'show' is found 34 times. Each line represents occurrences in the description items of a feed for a day for all 64 keywords. So I think that the low frequencies are not due to stemming. Rather I see it as the inevitable result of some keywords being popular in feeds of one category, but which don't appear in other feeds at all. Hence the spareness shown in the results. Generic keywords may also be pertinent here as well. The other possibilities are differences in the numbers of feeds per category where more feeds are in categories like NCA than S, or the keyword selection process itself is at fault.
You don't mention anything about stemming. In my opinion you could have better results if you were performing word stemming and the WEKA evaluation was based on the keyword stems. For example let's suppose that your WEKA model is built given a keyword surfing and a new rss feed contains the word surf. There should be a match between these two words. There are many free available stemmers for several languages. For the English language some available options for stemming are: The Porter's stemmer Stemming based on the WordNet's dictionary In case you would like to perform stemming using the WordNet's dictionary, there are libraries & frameworks that perform integration with WordNet. Below you can find some of them: MIT Java WordNet interface (JWI) Rita Java WorNet Library (JWNL) EDITED after more information was provided I believe that the keypoint in the specified case is the selection of the "most popular 64 words". The selected words or phrases should be keywords or keyphrases. So the challenge here is the keywords or keyphrases extraction. There are several books, papers and algorithms written about keywords/keyphrases extraction. The university of Waikato has implemented in JAVA, a famous algorithm called Keyword Extraction Algorithm (KEA). KEA extracts keyphrases from text documents and can be either used for free indexing or for indexing with a controlled vocabulary. The implementation is distributed under the GNU General Public License. Another issue that should be taken into consideration is the (Part of Speech)POS tagging. Nouns contain more information than the other POS tags. Therefore may you would have better results if you were checking the POS tag and the selected 64 words were mostly nouns. In addition according to the Anette Hulth's published paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, her experiments showed that the keywords/keyphrases mostly have or are contained in one of the following five patterns: ADJECTIVE NOUN (singular or mass) NOUN NOUN (both sing. or mass) ADJECTIVE NOUN (plural) NOUN (sing. or mass) NOUN (pl.) NOUN (sing. or mass) In conclusion a simple action that in my opinion could improve your results is to find the POS tag for each word and select mostly nouns in order to evaluate the new RSS feeds. You can use WordNet in order to find the POS tag for each word and as I mentioned above there are many libraries on the web that perform integration with the WordNet's dictionary. Of course stemming is also essential for the classification process and has to be maintained. I hope this helps.
Try turning off stemming altogether. The Stanford Intro to IR authors provide a rough justification of why stemming hurts, and at the very least does not help, in text classification contexts. I have tested stemming myself on a custom multinomial naive Bayes text classification tool (I get accuracies of 85%). I tried the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, plus no stemming, and I did the tests on small and larger training document corpora. Stemming significantly degraded classification accuracy when the training corpus was small, and left accuracy unchanged for the larger corpus, which is consistent with the Intro to IR statements. Some more things to try: Why only 64 words? I would increase that number by a lot, but preferably you would not have a limit at all. Try tf-idf (term frequency, inverse document frequency). What you're using now is just tf. If you multiply this by idf you can mitigate problems arising from common and uninformative words like "show". This is especially important given that you're using so few top words. Increase the size of the training corpus. Try shingling to bi-grams, tri-grams, etc, and combinations of different N-grams (you're now using just unigrams). There's a bunch of other knobs you could turn, but I would start with these. You should be able to do a lot better than 60%. 80% to 90% or better is common.