Output confusion matrix in Weka from command line - weka

I've saved a random forest model to a file using Weka 3.7.9, and I'm now trying to evaluate it against other (very big) set (on some big machines in Amazon EC2). I'm using the following command line:
> java -server -Xmx60000m -cp weka.jar weka.classifiers.Evaluation
weka.classifiers.trees.RandomForest -T test.arff -l random-forest.model
-i -no-cv
However, the only output I have is something like this:
=== Error on test data ===
Correctly Classified Instances 3252532 80.0686 %
Incorrectly Classified Instances 809651 19.9314 %
Kappa statistic 0.2884
Mean absolute error 0.2539
Root mean squared error 0.3608
Coverage of cases (0.95 level) 98.7413 %
Total Number of Instances 4062183
Whereas I'm looking besides for something like this:
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.804 0.295 0.731 0.804 0.766 0.512 0.826 0.803 buyer
0.705 0.196 0.783 0.705 0.742 0.512 0.826 0.798 non-buyer
Weighted Avg. 0.755 0.245 0.757 0.755 0.754 0.512 0.826 0.801
=== Confusion Matrix ===
a b <-- classified as
61728 15004 | a = buyer
22662 54066 | b = non-buyer
Please note that, even if I run the full training method again, like this:
> java -Xmx60000m -cp weka.jar weka.classifiers.Evaluation
weka.classifiers.trees.RandomForest -t train.arff -T test.arff
-I 10 -K 0 -S 1 -num-slots 8 -d random-forest.model -i -no-cv
I still doesn't show the confusion matrix for the test-data (only for the trained data).

It works when you omit the -no-cv option.

Related

Weka j48 output

I have confusion about the numbers at the end of the branches of a J48 tree. For example, using the weather.nominal data the tree looks the same, whether the Test options are set to Use training set or Cross-validation or Percentage split.
This is the output:
J48 pruned tree
------------------
outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
According to the textbook by the authors of this software, in an example using this exact data they say, "In the tree structure, a colon introduces the class label that has been assigned to a particular leaf, followed by the number of instances that reach that leaf, expressed as a decimal number because of the way the algorithm uses fractional instances to handle missing values. If there were incorrectly classified instances (there aren’t in this example) their number would appear, too: thus 2.0/1.0 means that two instances reached that leaf, of which one is classified incorrectly"
So this means that no instances were incorrectly classified in the above tree with the weather.nominal dataset.
On the other hand, when the test options are set to either 'Use training set' or 'Percentage split' (with the default random seed), there are incorrectly classified instances. For example, with a 60 percentage split, it shows the following
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 2 40 %
Incorrectly Classified Instances 3 60 %
There seems to be a contradiction here but I must be missing something. Is the tree shown initially not the tree that is built with the 60 percentage split?
That is not stated anywhere as far as I have seen but I can't think of any other explanation.
Just for completeness, the data is here:
outlook,temperature,humidity,windy,play
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
If you take a closer look at the output, you will see the following:
=== Classifier model (full training set) ===
The model that is being depicted there is the model that was trained on the full dataset, not your split.
The next section has the following heading:
=== Evaluation on test split ===
The statistics that you are referring to are based on a model trained and evaluated on your dataset split.

Using rrdtool to monitor few servers

Please help to understand. I find the simple of script in off site about update RRDTool base.
But for me need to create one rrd base to all servers. Please help to understand what way the best and just give some point how to do this.
Send data from all servers to rrdtool base and update it? or try to get all data from server where rrdtool and update in local level?
#!/bin/sh
a=0
while [ "$a" == 0 ]; do
snmpwalk -c public 192.168.1.250 hrSWRunPerfMem > snmp_reply
total_mem=`awk 'BEGIN {tot_mem=0}
{ if ($NF == "KBytes")
{tot_mem=tot_mem+$(NF-1)}
}
END {print tot_mem}' snmp_reply`
# I can use N as a replacement for the current time
rrdtool update target.rrd N:$total_mem
# sleep until the next 300 seconds are full
perl -e 'sleep 300 - time % 300'
done # end of while loop

Weka Document Clustering: Doc ID not visible in the output

I have to crawl Wikipedia to get HTML pages of countries. I have successfully crawled. Now to build clusters, I have to do KMeans. I am using Weka for that.
I have used this code to convert my directory into arff format:
https://weka.wikispaces.com/file/view/TextDirectoryToArff.java
Here is its output:
enter image description here
Then I opened that file in Weka and performed StringToWordVector conversion with these parameters:
Then I performed Kmeans. The output I am getting is:
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
Relation: text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 28
Attributes: 1040
[list of attributes omitted]
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
Number of iterations: 2
Within cluster sum of squared errors: 1915.0448503841326
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(28) (22) (6)
====================================================================================
.
.
.
.
.
bolsheviks 0.3652 0.3044 0.5878
book 0.3229 0.3051 0.3883
border 0.4329 0.5509 0
border-left-style 0.4329 0.5509 0
border-left-width 0.3375 0.4295 0
border-spacing 0.3124 0.3304 0.2461
border-width 0.5128 0.2785 1.372
boundary 0.309 0.3007 0.3392
brazil 0.381 0.3744 0.4048
british 0.4387 0.2232 1.2288
brown 0.2645 0.2945 0.1545
cache-control=max-age=87840 0.4913 0.4866 0.5083
california 0.5383 0.5085 0.6478
called 0.4853 0.6177 0
camp 0.4591 0.5451 0.1437
canada 0.3176 0.3358 0.251
canadian 0.2976 0.1691 0.7688
capable 0.2475 0.315 0
capita 0.388 0.1188 1.375
carbon 0.3889 0.445 0.1834
caribbean 0.4275 0.5441 0
carlsbad 0.548 0.5339 0.5998
caspian 0.4737 0.5345 0.2507
category 0.2216 0.2821 0
censorship 0.2225 0.0761 0.7596
center 0.4829 0.4074 0.7598
central 0.211 0.0805 0.6898
century 0.2645 0.2041 0.4862
chad 0.3636 0.0979 1.3382
challenger 0.5008 0.6374 0
championship 0.6834 0.8697 0
championships 0.2891 0.1171 0.9197
characteristics 0.237 0 1.1062
charon 0.5643 0.4745 0.8934
china
.
.
.
.
.
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 22 ( 79%)
1 6 ( 21%)
How to check which DocId is in which cluster? I have searched a lot but didnt find anything.
Also, is there any other good Java Library for Kmeans and agglomerate clustering?

How can we use clustering results in weka ?

I am using Weka for my internship but I have a little knowledge about data mining. So, maybe someone knows how can I apply the following results on my data-sets to get all data by cluster ? The method that I use now is to compute distances between my attributes and the mean value of each cluster then I classify them by the nearest value. But this method is too rough for me .
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: wcet_cluster6 - Copie-weka.filters.unsupervised.attribute.Remove-R1-3,5-weka.filters.unsupervised.attribute.Remove-R5-12
Instances: 467
Attributes: 4
max
alt
stmt
bb
Test mode:evaluate on training data
=== Model and evaluation on training set ===
EM
Number of clusters selected by cross validation: 6
Cluster
Attribute 0 1 2 3 4 5
(0.28) (0.11) (0.25) (0.16) (0.04) (0.17)
==================================================================
max
mean 9.0148 10.9112 11.2826 10.4329 11.2039 10.0546
std. dev. 1.8418 2.7775 3.0263 2.5743 2.2014 2.4614
alt
mean 0.0003 19.6467 0.4867 2.4565 44.191 8.0635
std. dev. 0.0175 5.7685 0.5034 1.3647 10.4761 3.3021
stmt
mean 0.7295 77.0348 3.2439 12.3971 140.9367 33.9686
std. dev. 1.0174 21.5897 2.3642 5.1584 34.8366 11.5868
bb
mean 0.4362 53.9947 1.4895 7.2547 114.7113 22.2687
std. dev. 0.5153 13.1614 0.9276 3.5122 28.0919 7.6968
Time taken to build model (full training data) : 4.24 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 163 ( 35%)
1 50 ( 11%)
2 85 ( 18%)
3 73 ( 16%)
4 18 ( 4%)
5 78 ( 17%)
Log likelihood: -9.09081
Thanks for your help!!
I think no-one can really answer this. Some tips off the top of my head.
You have used the EM clustering algorithm, see animated gif on wikipedia page. From Weka's Documentation Synopsis:
"EM assigns a probability distribution to each instance which
indicates the probability of it belonging to each of the clusters. "
Is this complex output really what you want?
It also selects a number of clusters for you (unless you constrain that number).
In weka 3.7 you can use the unsupervised attribute filter "ClusterMembership" in the Preprocess dialog to replace your dataset with a result of the cluster assignments. You need to select one reference attribute, though. By default it selects the last one. This creates hard-to -interpret output.

Recommended values for OpenCV SVM parameters

Any idea on the recommended parameters for OpenCV SVM? I'm playing with the letter_recog.cpp in the OpenCV sample directory, however, the SVM accuracy is very poor! In one run I only got 62% accuracy:
$ ./letter_recog_modified -data /home/cobalt/opencv/samples/data/letter-recognition.data -save svm_letter_recog.xml -svm
The database /home/cobalt/opencv/samples/data/letter-recognition.data is loaded.
Training the classifier ...
data.size() = [16 x 20000]
responses.size() = [1 x 20000]
Recognition rate: train = 64.3%, test = 62.2%
The default parameters are:
model = SVM::create();
model->setType(SVM::C_SVC);
model->setKernel(SVM::LINEAR);
model->setC(1);
model->train(tdata);
Setting it to trainAuto() didn't help; it gave me a weird 0 % test accuracy:
model = SVM::create();
model->setType(SVM::C_SVC);
model->setKernel(SVM::LINEAR);
model->trainAuto(tdata);
Result:
Recognition rate: train = 0.0%, test = 0.0%
Update using yangjie's answer:
$ ./letter_recog_modified -data /home/cobalt/opencv/samples/data/letter-recognition.data -save svm_letter_recog.xml -svm
The database /home/cobalt/opencv/samples/data/letter-recognition.data is loaded.
Training the classifier ...
data.size() = [16 x 20000]
responses.size() = [1 x 20000]
Recognition rate: train = 58.8%, test = 57.5%
The result is no longer 0% but the accuracy is worse than the 62% earlier.
Using the RBF kernel with trainAuto() is worst?
$ ./letter_recog_modified_rbf -data /home/cobalt/opencv/samples/data/letter-recognition.data -save svm_letter_recog.xml -svm
The database /home/cobalt/opencv/samples/data/letter-recognition.data is loaded.
Training the classifier ...
data.size() = [16 x 20000]
responses.size() = [1 x 20000]
Recognition rate: train = 18.5%, test = 11.6%
Parameters:
model = SVM::create();
model->setType(SVM::C_SVC);
model->setKernel(SVM::RBF);
model->trainAuto(tdata);
I suggest trying RBF kernel instead of linear.
In many, many cases it is the best choice...
I debugged the sample code and found the reason.
The responses is a Mat of ASCII code of the letters.
However, the predicted labels returned from SVM trained by SVM::trainAuto are ranging from 0-25, which correspond to the 26 classes. This can also be observed by looking at <class_labels>...</class_labels> in the output file svm_letter_recog.xml.
Therefore in test_and_save_classifier, r = model->predict( sample ) and responses.at<int>(i) are apparently not equal.
I also found that if we use SVM::train, the class labels would be from 65-89 instead, which is why you can get normal result at first.
Solution
I am not sure whether it is a bug. But if you want to use SVM::trainAuto in this sample now, you can change
test_and_save_classifier(model, data, responses, ntrain_samples, 0, filename_to_save);
in build_svm_classifier to
test_and_save_classifier(model, data, responses, ntrain_samples, 'A', filename_to_save);
Update
trainAuto and train should have the same behavior in class_labels. The problem is due to a bug fix before. So I have created a pull request to OpenCV to fix the problem.