Weka classifier meta Vote - weka

I am using Majority voting combination rule in weka. There are 4 classifiers in total. I am wondering what will happen if there is a tie in the number of votes

Weka API : ".....the random number generator used for breaking ties in majority voting..".
See :http://fiji.sc/javadoc/weka/classifiers/meta/Vote.html
If you are interested in code , this is how they do it:
// Resolve the ties according to a uniform random distribution
int majorityIndex = majorityIndexes.get(m_Random
.nextInt(majorityIndexes.size()));

Related

Interpreting results using J48 for a divided attribute of interest in x levels (WEKA)

I'm new to data mining and Weka. I built a classifier with J48 in Weka using the GUI, with J48 (training set) for an attribute of interest in five levels. I have to evaluate the precision of the model, but I don't know very well how to do it! Some information may be of interest:
== Detailed Accuracy By Class ===
Precision
0.80
?
0.67
0.56
?
?
First, I would like to know the meaning of the "?" in the precision column. When probing with an attribute of interest in two levels I got no "?". The tree is bigger now than when dividing into two levels. I am questioning if this means that taking an attribute of interest in five levels could generate a less efficient tree in terms of classification and computation time. This seems quite obvious as the number of Correctly Classified Instances when the attribute had 2 levels were up to 72%.
Thank you in advance, all interesting answers will be rewarded!
"I would like to know the meaning of the "?" in the precision column"
Note that for these same classes the TP and FP rates are 0. It appears that J48 has not assigned any of your observations to these classes.
Are these classes relatively small? If so, you might want to consider using the ClassBalancer filter. This will use weights to make all classes look the same size.
Of course, after you get the model you need to "convert back" to the real situation. This is similar for correcting for physically oversampling or undersampling. See my answer here: https://stats.stackexchange.com/questions/211174/how-to-exact-prediction-from-over-sampled-dataundoing-oversampling/257507#257507

Overfitting with random forest though very successful cross validation results

I have moderate experience with data science. I have a data set with 9500 observations and more than 4500 features most of which are highly correlated. Here is briefly what I have tried: I have dropped columns where there are less than 6000 non-NAs and have imputed NAs with their corresponding columns' median values when there are at least 6000 non-NAs. As for correlation, I have kept only features having at most 0.7 correlation with others. By doing so, I have reduced the number of features to about 750. Then I have used those features in my binary classification task in random forest.
My data set is highly unbalanced where ratio of (0:1) is (10:1). So when I apply RF with 10-fold cv, I observe too good results in each cv (AUC of 99%) which is to good to be true and in my test set I got way worse results such as 0.7. Here is my code:
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init(port=23, nthreads=4)
train = fs_rf[fs_rf['Year'] <= '201705']
test = fs_rf[fs_rf['Year'] > '201705']
train = train.drop('Year',axis=1)
test = test.drop('Year',axis=1)
test.head()
train = h2o.H2OFrame(train)
train['BestWorst2'] = train['BestWorst2'].asfactor()
test = h2o.H2OFrame(test)
test['BestWorst2'] = test['BestWorst2'].asfactor()
training_columns = train.drop('BestWorst2',axis=1).col_names
response_column = 'BestWorst2'
model = H2ORandomForestEstimator(ntrees=100, max_depth=20, nfolds=10, balance_classes=True)
model.train(x=training_columns, y=response_column, training_frame=train)
performance = model.model_performance(test_data=test)
print(performance)
How could I avoid this over-fitting? I have tried many different parameters in grid search but none of them improved the results.
This is not what I would call "overfitting". The reason you are seeing really good cross-validation metrics compared to your test metrics is that you have time-series data and so you can't use k-fold cross-validation to give you an accurate estimate of performance.
Performing k-fold cross-validation on a time-series dataset will give you overly-optimistic performance metrics because you are not respecting the time-series component in your data. Regular k-fold cross-validation will randomly sample from your whole dataset to create a train & validation set. Essentially, your validation strategy is "cheating" because you have "future" data included in your CV training sets (if that makes any sense).
I can see by your code that you understand that you need to train with "past" data and predict on "future" data, but if you want to read more about this topic, I'd recommend this article or this article.
One solution is to simply look at test set performance as way to evaluate your model. Another option is to use what's called "rolling" or "time-series" cross-validation, but H2O does not currently support that (though it seems like it might be added soon). Here's a ticket for this if you want to keep track of the progress.

Negative test score for random forest

Hi I am using a random forest classifier to product logerror. The log error contains both =ve & -ve values. After running the classifier with different settings. i am able to get training test score of around 0.8 but the test score is always negative. why is that so?
should i be using abs(log error) for prediction or is my choice for random forest wrong?
Choice of the random forest might be wrong but you better check it in context of data as if you have shared data here it would be easy to help you at the exact point. But, I suggest you try Knn if your total observations are around 1000-2000.
Also, if you are using any kind of encoding to convert categorical data to nominal please use only one hot encoding as other my add values to attributes.
You should have checked correlation of attributes to the target variable as the low correlation of target variable in test data may result in the negative score.
Apart from above all, distribution of data plays a vital role in randomforest regresssion. So, try to check distribution and apply methods such as box-cox to convert data in normal distribution.

What is the advantage of using weighted average F measure in weka

In weka I have seen the F-measure of the 'yes' class and 'no' class seperately. But what is the advantage of using the weighted average F-measure to compare the performance of the models. Please help me to find the answer :)
Let's start with a smart example, classifying protein interactions in text using machine learning, where our classifier has attempted to classify sentences into two classes: (1) positive class (2) negative class. Positive class contains sentences that describe protein interactions and negative class comprises sentences that do not describe protein interactions. As a researcher, my focus will be the F-score of my classifiers for positive class. Why? Because I am interested to see my classifier's performance on classifying sentences that contain protein interactions and I do not care about its ability to classify negative sentences. Therefore, I will consider only the F-score of the positive class.
However, for another classical problem like spam classification, where our classifier classifies emails into two classes: (1) hams and (2) spams, the scenario is a bit different. As a researcher, I would like to know my classifier's ability to classify hams as well as spams. At that point, I can either check the F-scores of each class independently or in an aggregated fashion. The weighted average of F-scores of ham and spam class is a means to check the performance of our classifier for both (in this case both, for multi-class problems read all) classes. Because the weighted F-measure is just the sum of all F-measures, each weighted according to the number of instances with that particular class label and for two classes, it is calculated as follows:
Weighted F-Measure=((F-Measure for n class X number of instances from n class)+(F-Measure for y class X number of instances from y class))/total instances in dataset.
So, the bottom line is- if the classification is sensitive for all the classes, use the weighted average of F-scores of all classes.
As far as I remember, It can better handle "extreme" precision or recall (P, R) numbers, when one or both are close to either 0 or 1. (They are generally negatively correlated).
This might happen when you want to apply different algorithms on a dataset and you end up with some precision/recall numbers that you need to compare.
Turns out that the simple average (P+R)/2 is too simplistic.
If you have a dataset where either precision or recall are close to 1 or zero, F-measure still takes the other one into account, somewhat arbitrarily.
(The name itself does not mean anything).
Andrew Ng explains it well in his machine-learning course, week 6 "Handling skewed data"

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.