I tried to run a simple classification on the iris.arff dataset in Weka, using the J48 algorithm. I used cross-validation with 10 folds and - if I'm not wrong - all the default settings for J48.
The result is a 96% accuracy with 6 incorrectly classified instances.
Here's my question: according to this the second number in the tree visualization is the number of the wrongly classified instances in each leaf, but then why their sum isn't 6 but 3?
EDIT: running the algorithm with different test options I obtain different results in terms of accuracy (and therefore number of errors), but when I visualize the tree I get always the same tree with the same 3 errors. I still can't explain why.
The second number in the tree visualization is not the number of the wrongly classified instances in each leaf - it's the total weight of those wrongly classified instances.
Did you, by any chance, weigh some of those instances with 0.5 instead of 1?
Another option is that you are actually executing two different models. One where you use the full training set to build the classifier (classifier.buildClassifier(instances)) and another one where you run Cross-validation (eval.crossValidateModel(...)) with 10 train/test folds. The first model will produce the visualised tree with less errors (larger trainingset) while the second model from CV produces the output statistics with more errors. This would explain why you get different stats when changing the test set but still the same tree that is built on the full set.
For the record: if you train (and visualise) the tree with the full dataset, you will appear to have less errors, but your model will actually be overfitted and the obtained performance measures will probably not be realistic. As such, your results from CV are much more useful and you should visualise the tree from that model.
I'm using the opencv Decision Trees for create a classifier. I would like to know if it is possible to retrain that model (that can be saved and loaded in a .yml file) adding new data. The version of Opencv that i'm using is 2.4.
I was thinking on something like this
CvDTree dtree;
dtree.load("existingTree.yml");
dtree.train(newValues, CV_ROW_SAMPLE, newResponses);
newValues contains only the new samples and newResponses contains the classes for that values. This would generate a new decision tree trained with the old values of the first training process and this new ones?
I didn't find any information on opencv documentation about this.
Short answer: No
Long answer: During training, when a decision tree is passed a large training set, each split node in the tree learns a feature set and a corresponding threshold. The branches of the tree terminate with leaf nodes that then stores the prediction values. If you have already trained a decision tree, then it has already learned, from a training set, all the features, threshold and prediction values. Training it again with a additional data would render the previously learned parameters useless.
Another way to look at this would be to think of Random Forest, which is formed by an ensemble of trees. Given that your new dataset is not too different from the data that the model has previously seen. If you want you can train a new tree and add it to a group of previously trained trees. During prediction, you can average the prediction of all trees to get an overall prediction.
I am using Weka Java API for training a model and making predictions. I am able to build a classifier based on 3 algorithms: Decision Trees, Naïve Bayes and Random Forest. Then able to classify a test instance and get a probability distribution over the target classes.
My question is – how do I show the reasoning/basis [consumable result, easily understandable] for the prediction. Why a given instance was classified as ‘A’, ‘B’ or ‘C’. Since the end-user would also like to know the logic behind the classification.
I'm trying to use j48 tree to perform a text categorization task. I read a lot of papers and websites that explain how to use classification having datasets whose data are single labeled.
In my case I have only multi-labeled data in my training set, what can I have to treat these data in a single decision tree? Or the only solution is generating many trees as many as the number of the labels?
You can use a tree with adapted entropy formula. You must define beforehand if your dataset has hierachical labels:
papers and code
I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.