I just wondered why is the % correctly classified differs from the Explorer and Experimenter aspects of Weka. I have checked to ensure I am employing 10-cross fold validation as well as all other paramaters!
Anyone have any ideas?
Thanks
I have the solution, as provided by Mark Hall, as I emailed him on the Weka Mail list. Here is the difference between Explorer and Experimenter:
The Experimenter operates differently from the Explorer. The Explorer
sums evaluation metrics over the folds of the cross validation - e.g.
percent correct is computed by summing all the correctly classified
instances over the test folds and then dividing by the total number of
instances. The Experimenter, on the other hand, computes averages over
the folds. Furthermore, the default in the Experimenter is to run 10
repetitions of 10-fold cross-validation (so 100 folds are averaged over).
Not sure but random seed might be different on Explorer and Experimenter. If this is the case, data sets will differ which results to different percentage.
Related
I am using the Weka GUI for classifying sensor data.
I have measures of 10 people, the data is sorted. So the first 10% correspond to participant 1, the second 10% to participant 2 etc.
I would like to use 10 fold cross validation to build a model on 9 participants and test it on the remaining participant. In my case I believe I could accomplish this by simply not randomizing the data splits.
How would I best go about doing this?
I don't know how to do this in the Explorer.
In the KnowledgeFlow GUI, there is a CrossValidationFoldMaker used to create cross-validation folds. This has an option to Preserve instances order, which says it preserves the order of instances rather than randomly shuffling.
There's a video describing the KnowledgeFlow interface here:
https://www.youtube.com/watch?v=sHSgoVX9z-8&t=7s
I tried to run a simple classification on the iris.arff dataset in Weka, using the J48 algorithm. I used cross-validation with 10 folds and - if I'm not wrong - all the default settings for J48.
The result is a 96% accuracy with 6 incorrectly classified instances.
Here's my question: according to this the second number in the tree visualization is the number of the wrongly classified instances in each leaf, but then why their sum isn't 6 but 3?
EDIT: running the algorithm with different test options I obtain different results in terms of accuracy (and therefore number of errors), but when I visualize the tree I get always the same tree with the same 3 errors. I still can't explain why.
The second number in the tree visualization is not the number of the wrongly classified instances in each leaf - it's the total weight of those wrongly classified instances.
Did you, by any chance, weigh some of those instances with 0.5 instead of 1?
Another option is that you are actually executing two different models. One where you use the full training set to build the classifier (classifier.buildClassifier(instances)) and another one where you run Cross-validation (eval.crossValidateModel(...)) with 10 train/test folds. The first model will produce the visualised tree with less errors (larger trainingset) while the second model from CV produces the output statistics with more errors. This would explain why you get different stats when changing the test set but still the same tree that is built on the full set.
For the record: if you train (and visualise) the tree with the full dataset, you will appear to have less errors, but your model will actually be overfitted and the obtained performance measures will probably not be realistic. As such, your results from CV are much more useful and you should visualise the tree from that model.
I've only been using Weka for a couple of weeks but I am absolutely blown away by how great it is!
But I have a question, I have a dataset with a target column which is either True or False.
6709 instances in my dataset are True
25318 instances are False.
I want to randomly add duplicates of my True instances to produce a new dataset with 25318 True and 25318 False.
The only filter I can find which does this is the supervised Resample filter however I am having trouble understanding what parameters I should use.
(there might be a better filter to do what I want)
I've got some success with these parameters
biasToUniformClass = 1.0
invertSelection = False
noReplacement = False
randomSeed = 1
sampleSizePercent = 157.5 (a magic number I've arrived at by trial and error)
This produces 25277 True and 25165 False. Not exactly what I want, but quite close.
The problem is that I cant figure out how to arrive at the magic number. I'm also not getting exactly the numbers of instances that I really want.
Is there a better filter for this purpose?
If not, is there a way to calculate the sampleSizePercent magic number?
Any help is greatly appreciated :)
Supplemental question, am I best to run NominalToBinary on my boolean columns to ensure they are Binary? I'm using a NaiveBayes classifier (at the moment) and I don't have any missing instances.
Jason
I think the tricky part of this question is getting a perfect balance using the Resample Filter. This is because, as it is stated in the description, it 'Produces a random sub-sample of a dataset using either sampling with replacement or without replacement'. If these cases are being drawn randomly, there is no guarantee that you will get an equal measure between the two classes.
As for the magic number, this would be associated with the total number of cases that you would like to have when the filter is applied. In your case, it would be 50636 instead of 32027. In this case, your magic number would be 50636 / 32027 = 1.581. However, as stated above, you may not get an exact match of true and false cases.
If you really need an exact figure, you could use your favourite spreadsheet and preprocess the data. One possible method is to randomise the true cases (in a separate column), sort and copy all of the cases until the number matches the false one. It's not an automated solution, and the solution is outside of Weka, but I have used this method before and does the job reasonably quickly.
Hope this Helps!
i use weka explorer,it has the train/test split percentage item.It take dataset into trainset and testset in given percentage.i dont know whether it will have repeated instance in trainset in weka RF.and whether the repeated will effect the result.
the RandomForest i know use bootstrap and the trainset have repeated instance and have same size with dataset.
whether it will have repeated instance in trainset in weka RF
yes, it makes bootstrap sample, so there will be repeated instances, have a look at the answer here: Exact implementation of RandomForest in Weka 3.7
and whether the repeated will effect the result
Well, it is the nature of Random Forest, it's how it works. But remember it just learns on repeated instance, the test set used for evaluation is left untouched.
I recently started using weka and I'm trying to classify tweets into positive or negative using Naive Bayes. So I have a training set with tweets that I gave the label for and a test set with tweets that all have the label "positive". When I ran Naive Bayes, I get the following results:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
Then if I change the labels of the tweets in the test set to "negative" and ran again Naive Bayes, the results are inversed:
Correctly classified instances: 6 8%
Incorrectly classified instances: 69 92%
I thought that correctly classified instances show the accuracy of Naive Bayes and that it should be the same no matter the labels of the tweets in test set. Is there something wrong with my data or I don't understand correctly the meaning of correctly classified instances?
Thanks a lot for your time,
Nantia
The labels on the test set are supposed to be the actual correct classification. Performance is computed by asking the classifier to give its best guess about the classification for each instance in the test set. Then the predicted classifications are compared to the actual classifications to determine accuracy. Therefore, if you flip the 'correct' values that you give it, the results will be flipped as well.
Based on your training set, 69.92% of your instances are classified as positive. If the labels for the test set, that is the correct answers, indicate that they are all positive, then that makes 69.92% correct. If the test set (and thus the classification) is the same, but you switch the correct answers, then of course, the percentage correct will also be the opposite.
Keep in mind that in order to evaluate a classifier, you need the true labels of the test set. Otherwise you can't compare the classifier's answers with the true answers. It seems to me that you might have misunderstood this. You can obtain the labels for unseen data, if that is what you want, but in that case you can't evaluate classifier accuracy.