WEKA: how to get the score from classifyInstance? - weka

I'm using a FilteredClassifier.classifyInstance() to classify my instances in weka.
I have 2 classes (true and false) and I have many positives, so I actually need to know the score of each isntance to get the best positive.
You know how I could get the score from my weka classifier ?
thanks
Update: I've also tried to use distributionForInstance, but for each instance I always get an array with [1.0, 0.0].
I actually need to compare several instances to see which one is the most reliable, which one has more changes to have been classified correctly.

distributionForInstance(Instance anInstance) is the method you need. It gives you a Double array showing the confidence for each of your classes. I am using Weka 3.6. and it works well for me. If you always get the same values, your classifier is not trained well and not discriminative at all. In that case, you should always get the same class predicted. Did you balance your training set?

distributionForInstance(Instance anInstance) seems right.
Maybe it is not working for you because the classifier doesn't know you'd need the confidence values? For example for LibSVM on Weka Java, you need to set setProbabilityEstimates to true, in order to use the scores.

After you have run the classifier on your data, you can visualize the data by right clicking on the test in the " Result list " There are lots of other funcitons on this right click menu that will allow you to gain scores from weka classifiers.

Suppose that your model is already trained.
Then, you can make predictions with distributionForInstance. This command produces an array consisting of two items (because there are two classes on your dataset: true and false)
double[] distributions = model.distributionForInstance(new_instance);
After then, index of the greatest item in distributions array would be classification result.
Assume that distributions = {0.9638458988630731, 0.03615410113692686}. In this case, your new instance would be classified as class_0 because 1st item is greater than 2nd item in distributions array.
You can also get this index with classifyInstance command.
double classifiedIndex = model.classifyInstance(new_instance);
classifiedIndex value would be 0 for distributions = {0.9638458988630731, 0.03615410113692686}.
Finally, you can get the class name as true or false instead of class index.
new_instance.setClassValue(classifiedIndex); //firstly, assigned classified index to new_instance.
String classifiedText = new_instance.stringValue(new_instance.numAttributes());
This code block produces false.
You might examine this GitHub project for both regression and classification.

Related

Can I make Amazon SageMaker deliver a recommendation based on historic data instead of a probability score?

We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)
For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.
predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.

Empty confusion matrix in Weka with test data

I am classifying iris data using DECISION TREE (C4.5), RANDOM FOREST and NAIVE BAYES. I am using the dataset downloaded from iris-train and iris-test. When I train the all networks everything is fine with proper results with 'classifier output', 'Detailed accuracy with class' and 'confusion matrix'. But, when I select the iris-test data in the Weka-explorer-classify-test options and select the iris-test file and in 'more options' select 'output prediction' as 'csv' and click start, I am getting the result as shown in the figure below. The 'classifier output' is showing the classified samples correctly, but, 'Detailed accuracy with class' and 'confusion matrix' is with all values zeros. Any suggestion where I am going wrong in selecting any option. Thank you.
The confusion matrix shows you how well your trained classifier performs by comparing the actual class of the instances in the test set with the class that was predicted by the classifier. But you are supplying a test set with no class information, so there's nothing to compare against. This is why you see
Total Number of Instances 0
Ignored Class Unknown Instances 120
in the output in your screenshot.
Typically you would first evaluate the performance of your classifier using cross-validation, or a test set that has class information. Then you can use the trained classifier to classify unknown data, for example using the Re-evaluate model on current test set right-click option as described in the help.

Weka Resample to balance instances in binary dataset

I've only been using Weka for a couple of weeks but I am absolutely blown away by how great it is!
But I have a question, I have a dataset with a target column which is either True or False.
6709 instances in my dataset are True
25318 instances are False.
I want to randomly add duplicates of my True instances to produce a new dataset with 25318 True and 25318 False.
The only filter I can find which does this is the supervised Resample filter however I am having trouble understanding what parameters I should use.
(there might be a better filter to do what I want)
I've got some success with these parameters
biasToUniformClass = 1.0
invertSelection = False
noReplacement = False
randomSeed = 1
sampleSizePercent = 157.5 (a magic number I've arrived at by trial and error)
This produces 25277 True and 25165 False. Not exactly what I want, but quite close.
The problem is that I cant figure out how to arrive at the magic number. I'm also not getting exactly the numbers of instances that I really want.
Is there a better filter for this purpose?
If not, is there a way to calculate the sampleSizePercent magic number?
Any help is greatly appreciated :)
Supplemental question, am I best to run NominalToBinary on my boolean columns to ensure they are Binary? I'm using a NaiveBayes classifier (at the moment) and I don't have any missing instances.
Jason
I think the tricky part of this question is getting a perfect balance using the Resample Filter. This is because, as it is stated in the description, it 'Produces a random sub-sample of a dataset using either sampling with replacement or without replacement'. If these cases are being drawn randomly, there is no guarantee that you will get an equal measure between the two classes.
As for the magic number, this would be associated with the total number of cases that you would like to have when the filter is applied. In your case, it would be 50636 instead of 32027. In this case, your magic number would be 50636 / 32027 = 1.581. However, as stated above, you may not get an exact match of true and false cases.
If you really need an exact figure, you could use your favourite spreadsheet and preprocess the data. One possible method is to randomise the true cases (in a separate column), sort and copy all of the cases until the number matches the false one. It's not an automated solution, and the solution is outside of Weka, but I have used this method before and does the job reasonably quickly.
Hope this Helps!

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.

Meaning of correctly classified instances weka

I recently started using weka and I'm trying to classify tweets into positive or negative using Naive Bayes. So I have a training set with tweets that I gave the label for and a test set with tweets that all have the label "positive". When I ran Naive Bayes, I get the following results:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
Then if I change the labels of the tweets in the test set to "negative" and ran again Naive Bayes, the results are inversed:
Correctly classified instances: 6 8%
Incorrectly classified instances: 69 92%
I thought that correctly classified instances show the accuracy of Naive Bayes and that it should be the same no matter the labels of the tweets in test set. Is there something wrong with my data or I don't understand correctly the meaning of correctly classified instances?
Thanks a lot for your time,
Nantia
The labels on the test set are supposed to be the actual correct classification. Performance is computed by asking the classifier to give its best guess about the classification for each instance in the test set. Then the predicted classifications are compared to the actual classifications to determine accuracy. Therefore, if you flip the 'correct' values that you give it, the results will be flipped as well.
Based on your training set, 69.92% of your instances are classified as positive. If the labels for the test set, that is the correct answers, indicate that they are all positive, then that makes 69.92% correct. If the test set (and thus the classification) is the same, but you switch the correct answers, then of course, the percentage correct will also be the opposite.
Keep in mind that in order to evaluate a classifier, you need the true labels of the test set. Otherwise you can't compare the classifier's answers with the true answers. It seems to me that you might have misunderstood this. You can obtain the labels for unseen data, if that is what you want, but in that case you can't evaluate classifier accuracy.