I'm working on a Web Service writen in Java to use Weka algorith j48 to classify some atributes. First it builds the classifier and then it classifies an instance using the classifier tree.
This is part of the code i have for the classifydata method
fc.buildClassifier(train);
for (int i = 0; i < test.numInstances(); i++)
{
double pred = fc.classifyInstance(test.instance(i));
predicated = (test.classAttribute().value((int) pred));
}
being fc the FilteredClassifier that was previously set, being train the data used to build the classifier and test the instance to classify
I'm also not sure if with this code i'm doing a good classification, if you could confirm that it would be nice.
What i really want is to get the "accuracy percentage". I don't really know if it is called like this but i don't know how else to reffer to it. Basicly i want something that will return the accuracy percentage of the classify result. Imagine i have simple a tree that has only 2 classifications, "1" or "2". Imagine i classify an instance and the result is "2". Now i want something that will return how accurate it is for the instance to be a "2", and who says accuracy says probability of being really a "2"
I hope i made myself clear because this is kinda new to me aswell
For this you have to use the distributionForInstance() method:
double[] probabilityDistribution = fc.distributionForInstance(test.instance[i])
Then if you have the two class values "1" and "2" (and you added the attribute/class values in that order to your class attribute), you can get the probabilities with which the given test instance is of one of the two class values by:
// Probability of the test instance beeing a "1"
double classAtt1Prob = probabilityDistribution[0];
// Probability of the test instance beeing a "2"
double classAtt2Prob = probabilityDistribution[1];
Related
We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)
For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.
predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.
I'm using WEKA Explorer to run a 10fold cross validation. I output the predictions to a CSV file. Because the 10fold approach mixes the order of the data, I do not know which specific data is correctly or incorrectly classified.
I mean, by looking at the CSV I do not know which specific 1 or 0 is classified as 1 or 0. Is there any way to see what is the classification result for every specific instance in test set for every fold? For example, it would be great if the CSV would record the ID of the instance being classified.
One alternative could be for me to implement the 10folds approach manually; i.e., I could create the 10 ARFF files and then run on each of them a percentage split with 90/10 (and preserve order). This solution looks pretty elaborated, effort expensive and error prone.
Thanks for your help!
To do that you need to do the following for every fold:
int result = new int[testSet.numInstances()];
for (int j = 0; j < testSet.numInstances(); j++) {
double res[j] = classifier.classifyInstance(testSet.get(j));
}
Now res array has the classification result for every Instance in test set. You can use this information as you want.
You can for example print the attributes of each instance(e.g if attributes are strings you can print them using (Before addingFilter) testSet.get(j).stringValue(PositionOfAttributeYouWantToPrint)) followed by the classification result.
Note that if the classification result is nominal value you can print it using this:
testSet.classAttribute().value((int)res[j]))
Usecase : Selecting the "optimal threshold" for a Logistic model built with statsmodel's Logit to predict say, binary classes (or multinomial, but integer classes)
To select your threshold for a (say,logistic) model in Python, is there something that's inbuilt ? For small data sets, I remember, optimizing for the "threshold", by picking up the maximum buckets of true predicted labels (true "0" and true "1") , best seen from the graph here -
http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
I also know intuitively that if I set alpha values, it should give me a "threshold" that I can use below. How should I compute the threshold given a reduced model with variables, all of which are significant at 95% confidence ? Obviously setting the threshold >0.5 ->"1" would be too naive & since I am looking at 95% confidence this threshold should be "smaller" , meaning p >0.2 or something.
This would then mean something like range of "critical values" if the label should be "1" and "0" otherwise.
What I want is something like this -:
test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data
# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989)
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%
# where y_train is REAL values & y_predicted back on the TRAIN dataset . Note that to get y_predicted (as binary, I already did the thresholding as above
Question :-
1. How can I select the threshold in an objective way - ie reduce the percentage of misclassified labels (say I care more for missing "1" (true positives), but not so much if I mispredict a "0" as "1" ( false negatives) & try to reduce this error. This I get from ROC curve . The roc curve in statsmodels(roc_curve) assumes that I have done the labelling for y_predicted class, and I am just revalidating this over test ( point me if my understanding is incorrect). I also think, using the confusion matrix also will not solve picking up the threshold problem
2. Which bring me to - How should I consume the output of these inbuilt functions (oob , confusion_matrix) to suit for selecting the optimal threshold (first on train sample, & then fine tune it over Test & cross validation sample)
I also looked up the official documentation of K-S tests in scipy here-
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
Related -:
Statistics Tests (Kolmogorov and T-test) with Python and Rpy2
I was wondering if there's a way to train the model using Naive Bayes, and then apply that to a single record. I'm new to weka so I dont know if this is possible. Also, is there a way to store the classifier output in a file?
The answer is yes since Naive Bayes is a model based on simple probabilistic Bayes theorem that can be used for classification challenges.
For classification using Naive Bayes, and other classifiers, you need to first train the model with a sample dataset, once trained the model can be applied to any record.
Of course there will be always an error probability when using this approach, but that depends mostly on the quality of your sample and the properties of your data set.
I haven't used Weka directly, but as an extension for Rapid Miner, but the principles must apply. Once the model is trained you should be able to see/print the model parameters.
I am currently searching for the same answer, while using java.
I created an arff file, which contains training date and used the programm http://weka.wikispaces.com/file/view/WekaDemo.java as an example to train and evaluate the classifer.
I still need to figure out, howto save and load a model in java and (more importantly) how to test against a single record.
WekaDemo.java
...
public void execute() throws Exception {
// run filter
m_Filter.setInputFormat(m_Training);
Instances filtered = Filter.useFilter(m_Training, m_Filter);
// train classifier on complete file for tree
m_Classifier.buildClassifier(filtered);
// 10fold CV with seed=1
m_Evaluation = new Evaluation(filtered);
m_Evaluation.crossValidateModel(
m_Classifier, filtered, 10, m_Training.getRandomNumberGenerator(1));
//TODO Save model
//TODO Load model
//TODO Test against a single information
}
...
Edit 1:
Save and loading a model is explained here: How to test existing model with new instance in weka, using java code?
In http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Classification-Classifying%20instances there is a quick how to for classifying a single instance.
//load model (saved from user interface)
Classifier tree = (Classifier) weka.core.SerializationHelper.read("/some/where/j48.model");
// load unlabeled data
Instances unlabeled = new Instances( new BufferedReader(new FileReader("/some/where/unlabeled.arff")));
// set class attribute
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = tree.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));
double[] dist =tree.distributionForInstance(unlabeled.instance(i))
for(int j=0; j<dist.length;j++){
System.print(unlabeled.classAttribute().value(j)+": " +dist[j]);
}
}
Edit This method doesn't train, evaluate and save a model. This is something I usually do using the weka gui. ( http://weka.wikispaces.com/Serialization )
This method uses a tree type model in the example with a nominal class, but that should be easily converted to a Naive Bayes example.
I'm using a FilteredClassifier.classifyInstance() to classify my instances in weka.
I have 2 classes (true and false) and I have many positives, so I actually need to know the score of each isntance to get the best positive.
You know how I could get the score from my weka classifier ?
thanks
Update: I've also tried to use distributionForInstance, but for each instance I always get an array with [1.0, 0.0].
I actually need to compare several instances to see which one is the most reliable, which one has more changes to have been classified correctly.
distributionForInstance(Instance anInstance) is the method you need. It gives you a Double array showing the confidence for each of your classes. I am using Weka 3.6. and it works well for me. If you always get the same values, your classifier is not trained well and not discriminative at all. In that case, you should always get the same class predicted. Did you balance your training set?
distributionForInstance(Instance anInstance) seems right.
Maybe it is not working for you because the classifier doesn't know you'd need the confidence values? For example for LibSVM on Weka Java, you need to set setProbabilityEstimates to true, in order to use the scores.
After you have run the classifier on your data, you can visualize the data by right clicking on the test in the " Result list " There are lots of other funcitons on this right click menu that will allow you to gain scores from weka classifiers.
Suppose that your model is already trained.
Then, you can make predictions with distributionForInstance. This command produces an array consisting of two items (because there are two classes on your dataset: true and false)
double[] distributions = model.distributionForInstance(new_instance);
After then, index of the greatest item in distributions array would be classification result.
Assume that distributions = {0.9638458988630731, 0.03615410113692686}. In this case, your new instance would be classified as class_0 because 1st item is greater than 2nd item in distributions array.
You can also get this index with classifyInstance command.
double classifiedIndex = model.classifyInstance(new_instance);
classifiedIndex value would be 0 for distributions = {0.9638458988630731, 0.03615410113692686}.
Finally, you can get the class name as true or false instead of class index.
new_instance.setClassValue(classifiedIndex); //firstly, assigned classified index to new_instance.
String classifiedText = new_instance.stringValue(new_instance.numAttributes());
This code block produces false.
You might examine this GitHub project for both regression and classification.