Imbalance between errors in data summary and tree visualization in Weka - weka

I tried to run a simple classification on the iris.arff dataset in Weka, using the J48 algorithm. I used cross-validation with 10 folds and - if I'm not wrong - all the default settings for J48.
The result is a 96% accuracy with 6 incorrectly classified instances.
Here's my question: according to this the second number in the tree visualization is the number of the wrongly classified instances in each leaf, but then why their sum isn't 6 but 3?
EDIT: running the algorithm with different test options I obtain different results in terms of accuracy (and therefore number of errors), but when I visualize the tree I get always the same tree with the same 3 errors. I still can't explain why.

The second number in the tree visualization is not the number of the wrongly classified instances in each leaf - it's the total weight of those wrongly classified instances.
Did you, by any chance, weigh some of those instances with 0.5 instead of 1?
Another option is that you are actually executing two different models. One where you use the full training set to build the classifier (classifier.buildClassifier(instances)) and another one where you run Cross-validation (eval.crossValidateModel(...)) with 10 train/test folds. The first model will produce the visualised tree with less errors (larger trainingset) while the second model from CV produces the output statistics with more errors. This would explain why you get different stats when changing the test set but still the same tree that is built on the full set.
For the record: if you train (and visualise) the tree with the full dataset, you will appear to have less errors, but your model will actually be overfitted and the obtained performance measures will probably not be realistic. As such, your results from CV are much more useful and you should visualise the tree from that model.

Related

Interpreting results using J48 for a divided attribute of interest in x levels (WEKA)

I'm new to data mining and Weka. I built a classifier with J48 in Weka using the GUI, with J48 (training set) for an attribute of interest in five levels. I have to evaluate the precision of the model, but I don't know very well how to do it! Some information may be of interest:
== Detailed Accuracy By Class ===
Precision
0.80
?
0.67
0.56
?
?
First, I would like to know the meaning of the "?" in the precision column. When probing with an attribute of interest in two levels I got no "?". The tree is bigger now than when dividing into two levels. I am questioning if this means that taking an attribute of interest in five levels could generate a less efficient tree in terms of classification and computation time. This seems quite obvious as the number of Correctly Classified Instances when the attribute had 2 levels were up to 72%.
Thank you in advance, all interesting answers will be rewarded!
"I would like to know the meaning of the "?" in the precision column"
Note that for these same classes the TP and FP rates are 0. It appears that J48 has not assigned any of your observations to these classes.
Are these classes relatively small? If so, you might want to consider using the ClassBalancer filter. This will use weights to make all classes look the same size.
Of course, after you get the model you need to "convert back" to the real situation. This is similar for correcting for physically oversampling or undersampling. See my answer here: https://stats.stackexchange.com/questions/211174/how-to-exact-prediction-from-over-sampled-dataundoing-oversampling/257507#257507

Overfitting with random forest though very successful cross validation results

I have moderate experience with data science. I have a data set with 9500 observations and more than 4500 features most of which are highly correlated. Here is briefly what I have tried: I have dropped columns where there are less than 6000 non-NAs and have imputed NAs with their corresponding columns' median values when there are at least 6000 non-NAs. As for correlation, I have kept only features having at most 0.7 correlation with others. By doing so, I have reduced the number of features to about 750. Then I have used those features in my binary classification task in random forest.
My data set is highly unbalanced where ratio of (0:1) is (10:1). So when I apply RF with 10-fold cv, I observe too good results in each cv (AUC of 99%) which is to good to be true and in my test set I got way worse results such as 0.7. Here is my code:
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init(port=23, nthreads=4)
train = fs_rf[fs_rf['Year'] <= '201705']
test = fs_rf[fs_rf['Year'] > '201705']
train = train.drop('Year',axis=1)
test = test.drop('Year',axis=1)
test.head()
train = h2o.H2OFrame(train)
train['BestWorst2'] = train['BestWorst2'].asfactor()
test = h2o.H2OFrame(test)
test['BestWorst2'] = test['BestWorst2'].asfactor()
training_columns = train.drop('BestWorst2',axis=1).col_names
response_column = 'BestWorst2'
model = H2ORandomForestEstimator(ntrees=100, max_depth=20, nfolds=10, balance_classes=True)
model.train(x=training_columns, y=response_column, training_frame=train)
performance = model.model_performance(test_data=test)
print(performance)
How could I avoid this over-fitting? I have tried many different parameters in grid search but none of them improved the results.
This is not what I would call "overfitting". The reason you are seeing really good cross-validation metrics compared to your test metrics is that you have time-series data and so you can't use k-fold cross-validation to give you an accurate estimate of performance.
Performing k-fold cross-validation on a time-series dataset will give you overly-optimistic performance metrics because you are not respecting the time-series component in your data. Regular k-fold cross-validation will randomly sample from your whole dataset to create a train & validation set. Essentially, your validation strategy is "cheating" because you have "future" data included in your CV training sets (if that makes any sense).
I can see by your code that you understand that you need to train with "past" data and predict on "future" data, but if you want to read more about this topic, I'd recommend this article or this article.
One solution is to simply look at test set performance as way to evaluate your model. Another option is to use what's called "rolling" or "time-series" cross-validation, but H2O does not currently support that (though it seems like it might be added soon). Here's a ticket for this if you want to keep track of the progress.

WEKA cross validation discretization

I'm trying to improve the accuracy of my WEKA model by applying an unsupervised discretize filter. I need to decided on the number of bins and whether equal frequency binning should be used. Normally, I would optimize this using a training set.
However, how do I determine the bin size and whether equal frequency binning should be used when using cross-validation? My initial idea was to use the accuracy result of the classifier in multiple cross-validation tests to find the optimal bin size. However, isn't it wrong, despite using cross-validation, to use this same set to also test the accuracy of the model, because I then have an overfitted model? What would then be a correct way of determining the bin sizes?
I also tried the supervized discretize filter to determine the bin sizes, however this results in only in single bins. Does this mean that my data is too random and therefore cannot be clustered into multiple bins?
Yes, you are correct in both your idea and your concerns for the first issue.
What you are trying to do is Parameter Optimization. This term is usually used when you try to optimize the parameters of your classifier, e.g., the number of trees for the Random Forest or the C parameter for SVMs. But you can apply it as well to pre-processing steps and filters.
What you have to do in this case is a nested cross-validation. (You should check https://stats.stackexchange.com/ for more information, for example here or here). It is important that the final classifier, including all pre-processing steps like binning and such, has never seen the test set, only the training set. This is the outer cross-validation.
For each fold of the outer cross-validation, you need to do an inner cross-validation on the training set to determine the optimal parameters for your model.
I'll try to "visualize" it on a simple 2-fold cross-validation
Data set
########################################
Split for outer cross-validation (2-fold)
#################### ####################
training set test set
Split for inner cross-validation
########## ##########
training test
Evaluate parameters
########## ##########
build with evaluated
bin size 5 acc 70%
bin size 10 acc 80%
bin size 20 acc 75%
...
=> optimal bin size: 10
Outer cross-validation (2-fold)
#################### ####################
training set test set
apply bin size 10
train model evaluate model
Parameter optimization can be very exhausting. If you have 3 parameters with 10 possible parameter values each, that makes 10x10x10=1000 parameter combinations you need to evaluate for each outer fold.
This is a topic of machine learning by itself, because you can do everything from the naive grid search to evolutionary search here. Sometimes you can use heuristics. But you need to do some kind of parameter optimization every time.
As for your second question: This is really hard to tell without seeing your data. But you should post that as a separate question anyway.

methods does not give confusion matrix in weka

I want to do classification in weka. I am using some methods(Random Tree, Random Forest, Decision Table, RandomSubspace...) but they give results like below.
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.1678
Mean absolute error 0.4832
Root mean squared error 0.4931
Relative absolute error 96.6501 %
Root relative squared error 98.6323 %
Total Number of Instances 100000
However I want results as accurancy and confusion matrix. How can I get results like that?
Note: When I use small dataset, it gives results as confusion matrix. Can it be related with the size of dataset?
The output of the training/testing in Weka depends on the type of the attribute that you are trying to predict. If your attribute is nominal, you will get a confusion matrix and accuracy value. If your attribute is numeric, you will get a correlation coefficient.
In your small and large datasets that you mention, what is your type of the attribute that you are predicting?
I have run a 2-class problem using J48 and RandomForest with 100000 instances and the confusion matrix appeared correctly. I additionally increased the problem complexity to run 20 different classes and the confusion matrix appeared correctly as well.
If you look under more options, please ensure that the 'output confusion matrix' is checked and see if this resolves the issue.

Meaning of correctly classified instances weka

I recently started using weka and I'm trying to classify tweets into positive or negative using Naive Bayes. So I have a training set with tweets that I gave the label for and a test set with tweets that all have the label "positive". When I ran Naive Bayes, I get the following results:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
Then if I change the labels of the tweets in the test set to "negative" and ran again Naive Bayes, the results are inversed:
Correctly classified instances: 6 8%
Incorrectly classified instances: 69 92%
I thought that correctly classified instances show the accuracy of Naive Bayes and that it should be the same no matter the labels of the tweets in test set. Is there something wrong with my data or I don't understand correctly the meaning of correctly classified instances?
Thanks a lot for your time,
Nantia
The labels on the test set are supposed to be the actual correct classification. Performance is computed by asking the classifier to give its best guess about the classification for each instance in the test set. Then the predicted classifications are compared to the actual classifications to determine accuracy. Therefore, if you flip the 'correct' values that you give it, the results will be flipped as well.
Based on your training set, 69.92% of your instances are classified as positive. If the labels for the test set, that is the correct answers, indicate that they are all positive, then that makes 69.92% correct. If the test set (and thus the classification) is the same, but you switch the correct answers, then of course, the percentage correct will also be the opposite.
Keep in mind that in order to evaluate a classifier, you need the true labels of the test set. Otherwise you can't compare the classifier's answers with the true answers. It seems to me that you might have misunderstood this. You can obtain the labels for unseen data, if that is what you want, but in that case you can't evaluate classifier accuracy.