how to identify which cluster belongs to which emotion - weka

I am new to weka. I need to identify the set of emotions from the blog documents based on Weka tool clustering method. For the emotion detection, I am using different feature sets values (My features are represented as attributes). For example, my data set will be as:
#relation emotion
#attribute pos real ->total no of times each part-of-speech(noun,verb,adjective,adverb) occur in the document / Total no of words in the document
#attribute Positive_Words real ->Count of positive words occur in the document / Total no of words in the document
#attribute Negative_Words real ->Count of Negative words occur in the document / Total no of words in the document
#attribute Emotion_Words real ->Count of Emotion words occur in the document / Total no of words in the document
#attribute First_Sent_Weight real ->Weight given to first sentence in each blog / Total no of sentences in the document
#data
0.4, 0.24, 0.43, 0.32, 0.65
0.32, 0.5, 0.74, 0.8, 0.43
I have 5000 instances (By giving, each feature set for every 5000 blog documents, I created 5000 instances). These instances are passed into clustering algorithm of K-means in Weka tool and 6 clusters are generated. My doubt is how to identify which cluster belongs to which emotion. Please suggest any ideas. Thanks in advance.

Related

AutoML VISION Google SingleLabel Classification output TopK results

Currently AutoML Vision API is outputting a SingleLabel with the respective Score
For example:
I trained the model with 3 classes:
A
B
C
Then when I am using Test & Use and I am uploading another image, I got only
[CURRENT OUTPUT]
Class A and 0.988437 / 0.99
Is there a way I can get this type of output with Top_K classes ( for example Top 3 (k=3) )
[DESIRED OUTPUT]
Class A and 0.988437 / 0.99
Class C and 0.3551 / 0.36
Class B and 0.1201 / 0.12
Sorted based on their Score.
Thanks in Advance.
Single-label classification assigns a single label to each classified image and it returns only one predicted class.
Multi-label is more suited for your use case as it allows an image to be assigned multiple labels.
In the UI (which is what you seem to be using) you can specify the type of classification you want your custom model to perform when you create your dataset.
If, for any reason, you would like to have the option to get all/k predicted classes scores on the single-label classification, I suggest that you raise a Feature Request.

supervised keyphrase extraction weka or other tool

How to use WEKA to find keyphrases with supervised méthod.
i have to learn model for keyphrase extraction, so i have a corpus for training (for every document a correspending file that contain keyphrases or keywords)
Also i have a corpus for test the supervised model (docuement without keyphrases file), so the model should output a list of keyphrases for every document.
My question is how to input the document into weka, should i add for every document
#attribute doc string
#data
"Docu1............"
"Docu2............"
...
..
"DocuN............"
Now how to input the files that contain th keyphrases for every document to learn from the model?
First you need choose what features want to use: the most basic algorithm only based on the tf-idf values.
https://code.google.com/p/kea-algorithm/
But you can extends this features your "task-specific" feautres too.
For example the first occurance of the phrase etc. You can find some possible features in this article: http://www.aclweb.org/anthology/S/S10/S10-1040.pdf
Than, you have to choose a machine learning algorithm and train it you train data set, and evaluate it on your test set.

WEKA - weather prediction

I am pretty new to the concepts of machine learning and clustering. I have installed Weka and am trying to figure out how it works. Currently, I have my training data as below.
#relation weather
#attribute year real
#attribute temperature real
#attribute warmer {yes,no}
#data
1956 , 68.98585 , yes
1957 , 67.52131 , yes
1958 , 65.853386 , no
1959 , 66.32705 , yes
1960 , 65.89773 , no
So, I am trying to build a model which should predict if it is getting warmer each and every year.
If I have to predict if 1961 is warmer or cooler, should I provide my test data like below?
#relation weather
#attribute year real
#attribute temperature real
#data
1961 , 70.98585
I have removed the column warmer which I want to predict using the training set I provided earlier. I can use any algorithm that Weka provides me (J48, BayesNet etc). Can someone please help me out in figuring how to understand the concepts?
You don't need to make the training and test sets yourself, Weka will do that for you. Even if you do, don't delete the value to predict from the test set -- Weka will make sure that everything happens properly, but needs the actual value to determine whether a prediction is correct or not and tell you how your model performs.
Your problem is a classification problem, i.e. you want to predict the label "yes" or "no". Not all of the algorithms in Weka are applicable, but the ones that are not are greyed out (if you use the GUI).
On a more general note, you're unlikely to get good results with the data that you have. This is more of a time series prediction task (i.e. given these past values, how will it develop in the future), for which Weka doesn't really offer the algorithms. You can find some more information on Wikipedia.
To get better models with Weka, you could add the temperature value from the previous year (or the previous 2 years) as a feature, but ultimately it sounds like you want to use something that can do time series analysis and predictions.

Weka : training and test set are not compatible

Each row of my training and test datasets has intensity values for pixels in an image with the last column having the label which tells what digit is represented in the image; the label can be any number from 0 to 9 in training set and is always ? on test set. I loaded the training dataset on Weka Explorer, passed the data through NumericalToNominal filter and used RemovePercentage filter to split the data in 70-30 ratio, the 30% file being used as cross validation set. I built a classifer and saved the model.
Then, I loaded the test data which has ? against label for each row and applied the NumericToNominal filter and saved it as arff file.Now, when i load the test data and try to user the model against it, I always get the error message saying "training and test set are not compatible". Both datasets have undergone the same processing. What possibly could have gone wrong?
As you can read from ARFF manual (http://www.cs.waikato.ac.nz/ml/weka/arff.html):
Nominal values are defined by providing an
listing the possible values: {, ,
, ...}
For example, the class value of the Iris dataset can be defined as
follows:
#ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
So when you apply NumericToNominal to your test file you can possibly have different number of possible values for one or more attributes within train and test arff - it really can happen, it bothered me many times - so one solution is to check your arff's manually (if it is not to big, or just copy and paste invocation of arff file with
e.g.
#attribute 'My first binary attribute' {0,1}
(...)
#attribute 'My last binary attribute' {0,1}
from train to test file - should work
you can use batch filtering, here you can read how to batch filtering in weka

Improving classification results with Weka J48 and Naive Bayes Multinomial classifiers

I have been using Weka’s J48 and Naive Bayes Multinomial (NBM) classifiers upon
frequencies of keywords in RSS feeds to classify the feeds into target
categories.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
…
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,12,0,0,0,0,0,20,0,0,0,0,0,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
…
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,FCL
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,F
…
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,7
8,0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,SNT
0,0,0,0,0,0,11,0,0,0,0,0,0,0,19,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,0,0,0,0,0,0
,0,0,0,0,0,0,0,0,0,17,0,0,0,0,0,0,0,0,0,0,0,0,0,20,0,S
And so on: there’s a total of 570 rows where each one is contains with the
frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
I am using 10 fold x validation for both the J48s and NBM classifiers on a
'black box' basis. Other parameters used are also defaults, i.e. 0.25 confidence
and min number of objects is 2 for the J48s.
So far, my classification rates for an instance of varying numbers of days, date
ranges and actual keyword frequencies with both J28 and NBM results being
consistent in the 50 - 60% range. But, I would like to improve this if possible.
I have reduced the decision tree confidence level, sometimes as low as 0.1 but
the improvements are very marginal.
Can anyone suggest any other way of improving my results?
To give more information, the basic process here involves a diverse collection of RSS feeds where each one belongs to a single category.
For a given date range, e.g. 01 - 10 Sep 2011, the text of each feed's item elements are combined. The text is then validated to remove words with numbers, accents and so on, and stop words (a list of 500 stop words from MySQL is used). The remaining text is then indexed in Lucene to work out the most popular 64 words.
Each of these 64 words is then searched for in the description elements of the feeds for each day within the given date range. As part of this, the description text is also validated in the same way as the title text and again indexed by Lucene. So a popular keyword from the title such as 'declines' is stemmed to 'declin': then if any similar words are found in the description elements which also stem to 'declin', such as 'declined', the frequency for 'declin' is taken from Lucene's indexing of the word from the description elements.
The frequencies shown in the .arff file match on this basis, i.e. on the first line above, 'nasa', 'fish', 'kill' are not found in the description items of a particular feed in the BFE category for that day, but 'show' is found 34 times. Each line represents occurrences in the description items of a feed for a day for all 64 keywords.
So I think that the low frequencies are not due to stemming. Rather I see it as the inevitable result of some keywords being popular in feeds of one category, but which don't appear in other feeds at all. Hence the spareness shown in the results. Generic keywords may also be pertinent here as well.
The other possibilities are differences in the numbers of feeds per category where more feeds are in categories like NCA than S, or the keyword selection process itself is at fault.
You don't mention anything about stemming. In my opinion you could have better results if you were performing word stemming and the WEKA evaluation was based on the keyword stems.
For example let's suppose that your WEKA model is built given a keyword surfing and a new rss feed contains the word surf. There should be a match between these two words.
There are many free available stemmers for several languages.
For the English language some available options for stemming are:
The Porter's stemmer
Stemming based on the WordNet's dictionary
In case you would like to perform stemming using the WordNet's dictionary, there are libraries & frameworks that perform integration with WordNet.
Below you can find some of them:
MIT Java WordNet interface (JWI)
Rita
Java WorNet Library (JWNL)
EDITED after more information was provided
I believe that the keypoint in the specified case is the selection of the "most popular 64 words". The selected words or phrases should be keywords or keyphrases. So the challenge here is the keywords or keyphrases extraction.
There are several books, papers and algorithms written about keywords/keyphrases extraction. The university of Waikato has implemented in JAVA, a famous algorithm called Keyword Extraction Algorithm (KEA). KEA extracts keyphrases from text documents and can be either used for free indexing or for indexing with a controlled vocabulary. The implementation is distributed under the GNU General Public License.
Another issue that should be taken into consideration is the (Part of Speech)POS tagging. Nouns contain more information than the other POS tags. Therefore may you would have better results if you were checking the POS tag and the selected 64 words were mostly nouns.
In addition according to the Anette Hulth's published paper Improved Automatic Keyword Extraction Given More Linguistic Knowledge, her experiments showed that the keywords/keyphrases mostly have or are contained in one of the following five patterns:
ADJECTIVE NOUN (singular or mass)
NOUN NOUN (both sing. or mass)
ADJECTIVE NOUN (plural)
NOUN (sing. or mass) NOUN (pl.)
NOUN (sing. or mass)
In conclusion a simple action that in my opinion could improve your results is to find the POS tag for each word and select mostly nouns in order to evaluate the new RSS feeds. You can use WordNet in order to find the POS tag for each word and as I mentioned above there are many libraries on the web that perform integration with the WordNet's dictionary. Of course stemming is also essential for the classification process and has to be maintained.
I hope this helps.
Try turning off stemming altogether. The Stanford Intro to IR authors provide a rough justification of why stemming hurts, and at the very least does not help, in text classification contexts.
I have tested stemming myself on a custom multinomial naive Bayes text classification tool (I get accuracies of 85%). I tried the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, plus no stemming, and I did the tests on small and larger training document corpora. Stemming significantly degraded classification accuracy when the training corpus was small, and left accuracy unchanged for the larger corpus, which is consistent with the Intro to IR statements.
Some more things to try:
Why only 64 words? I would increase that number by a lot, but preferably you would not have a limit at all.
Try tf-idf (term frequency, inverse document frequency). What you're using now is just tf. If you multiply this by idf you can mitigate problems arising from common and uninformative words like "show". This is especially important given that you're using so few top words.
Increase the size of the training corpus.
Try shingling to bi-grams, tri-grams, etc, and combinations of different N-grams (you're now using just unigrams).
There's a bunch of other knobs you could turn, but I would start with these. You should be able to do a lot better than 60%. 80% to 90% or better is common.