Negative test score for random forest - python-2.7

Hi I am using a random forest classifier to product logerror. The log error contains both =ve & -ve values. After running the classifier with different settings. i am able to get training test score of around 0.8 but the test score is always negative. why is that so?
should i be using abs(log error) for prediction or is my choice for random forest wrong?

Choice of the random forest might be wrong but you better check it in context of data as if you have shared data here it would be easy to help you at the exact point. But, I suggest you try Knn if your total observations are around 1000-2000.
Also, if you are using any kind of encoding to convert categorical data to nominal please use only one hot encoding as other my add values to attributes.
You should have checked correlation of attributes to the target variable as the low correlation of target variable in test data may result in the negative score.
Apart from above all, distribution of data plays a vital role in randomforest regresssion. So, try to check distribution and apply methods such as box-cox to convert data in normal distribution.

Related

Empty confusion matrix in Weka with test data

I am classifying iris data using DECISION TREE (C4.5), RANDOM FOREST and NAIVE BAYES. I am using the dataset downloaded from iris-train and iris-test. When I train the all networks everything is fine with proper results with 'classifier output', 'Detailed accuracy with class' and 'confusion matrix'. But, when I select the iris-test data in the Weka-explorer-classify-test options and select the iris-test file and in 'more options' select 'output prediction' as 'csv' and click start, I am getting the result as shown in the figure below. The 'classifier output' is showing the classified samples correctly, but, 'Detailed accuracy with class' and 'confusion matrix' is with all values zeros. Any suggestion where I am going wrong in selecting any option. Thank you.
The confusion matrix shows you how well your trained classifier performs by comparing the actual class of the instances in the test set with the class that was predicted by the classifier. But you are supplying a test set with no class information, so there's nothing to compare against. This is why you see
Total Number of Instances 0
Ignored Class Unknown Instances 120
in the output in your screenshot.
Typically you would first evaluate the performance of your classifier using cross-validation, or a test set that has class information. Then you can use the trained classifier to classify unknown data, for example using the Re-evaluate model on current test set right-click option as described in the help.

how do generate confidence intervals/standard error of the mean for AUC?

When I use a classifier like LibSVM/SMO under 10-fold cross validation, I get an output for ROC area (AUC) but there is no confidence interval given (as it was performed 10 times). Is there a way to output this?
With "output predictions" in "classifier evaluation option", you get predictions for each instances; with these prediction values you can compute AUC and confidence interval with other program (for exaple R or http://www.biosoft.hacettepe.edu.tr/easyROC/). I suggest to use leave one out cross validation.

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.

Meaning of correctly classified instances weka

I recently started using weka and I'm trying to classify tweets into positive or negative using Naive Bayes. So I have a training set with tweets that I gave the label for and a test set with tweets that all have the label "positive". When I ran Naive Bayes, I get the following results:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
Then if I change the labels of the tweets in the test set to "negative" and ran again Naive Bayes, the results are inversed:
Correctly classified instances: 6 8%
Incorrectly classified instances: 69 92%
I thought that correctly classified instances show the accuracy of Naive Bayes and that it should be the same no matter the labels of the tweets in test set. Is there something wrong with my data or I don't understand correctly the meaning of correctly classified instances?
Thanks a lot for your time,
Nantia
The labels on the test set are supposed to be the actual correct classification. Performance is computed by asking the classifier to give its best guess about the classification for each instance in the test set. Then the predicted classifications are compared to the actual classifications to determine accuracy. Therefore, if you flip the 'correct' values that you give it, the results will be flipped as well.
Based on your training set, 69.92% of your instances are classified as positive. If the labels for the test set, that is the correct answers, indicate that they are all positive, then that makes 69.92% correct. If the test set (and thus the classification) is the same, but you switch the correct answers, then of course, the percentage correct will also be the opposite.
Keep in mind that in order to evaluate a classifier, you need the true labels of the test set. Otherwise you can't compare the classifier's answers with the true answers. It seems to me that you might have misunderstood this. You can obtain the labels for unseen data, if that is what you want, but in that case you can't evaluate classifier accuracy.

WEKA: how to get the score from classifyInstance?

I'm using a FilteredClassifier.classifyInstance() to classify my instances in weka.
I have 2 classes (true and false) and I have many positives, so I actually need to know the score of each isntance to get the best positive.
You know how I could get the score from my weka classifier ?
thanks
Update: I've also tried to use distributionForInstance, but for each instance I always get an array with [1.0, 0.0].
I actually need to compare several instances to see which one is the most reliable, which one has more changes to have been classified correctly.
distributionForInstance(Instance anInstance) is the method you need. It gives you a Double array showing the confidence for each of your classes. I am using Weka 3.6. and it works well for me. If you always get the same values, your classifier is not trained well and not discriminative at all. In that case, you should always get the same class predicted. Did you balance your training set?
distributionForInstance(Instance anInstance) seems right.
Maybe it is not working for you because the classifier doesn't know you'd need the confidence values? For example for LibSVM on Weka Java, you need to set setProbabilityEstimates to true, in order to use the scores.
After you have run the classifier on your data, you can visualize the data by right clicking on the test in the " Result list " There are lots of other funcitons on this right click menu that will allow you to gain scores from weka classifiers.
Suppose that your model is already trained.
Then, you can make predictions with distributionForInstance. This command produces an array consisting of two items (because there are two classes on your dataset: true and false)
double[] distributions = model.distributionForInstance(new_instance);
After then, index of the greatest item in distributions array would be classification result.
Assume that distributions = {0.9638458988630731, 0.03615410113692686}. In this case, your new instance would be classified as class_0 because 1st item is greater than 2nd item in distributions array.
You can also get this index with classifyInstance command.
double classifiedIndex = model.classifyInstance(new_instance);
classifiedIndex value would be 0 for distributions = {0.9638458988630731, 0.03615410113692686}.
Finally, you can get the class name as true or false instead of class index.
new_instance.setClassValue(classifiedIndex); //firstly, assigned classified index to new_instance.
String classifiedText = new_instance.stringValue(new_instance.numAttributes());
This code block produces false.
You might examine this GitHub project for both regression and classification.