WEKA input of predictions in 10folds CSV output - weka

I'm using WEKA Explorer to run a 10fold cross validation. I output the predictions to a CSV file. Because the 10fold approach mixes the order of the data, I do not know which specific data is correctly or incorrectly classified.
I mean, by looking at the CSV I do not know which specific 1 or 0 is classified as 1 or 0. Is there any way to see what is the classification result for every specific instance in test set for every fold? For example, it would be great if the CSV would record the ID of the instance being classified.
One alternative could be for me to implement the 10folds approach manually; i.e., I could create the 10 ARFF files and then run on each of them a percentage split with 90/10 (and preserve order). This solution looks pretty elaborated, effort expensive and error prone.
Thanks for your help!

To do that you need to do the following for every fold:
int result = new int[testSet.numInstances()];
for (int j = 0; j < testSet.numInstances(); j++) {
double res[j] = classifier.classifyInstance(testSet.get(j));
}
Now res array has the classification result for every Instance in test set. You can use this information as you want.
You can for example print the attributes of each instance(e.g if attributes are strings you can print them using (Before addingFilter) testSet.get(j).stringValue(PositionOfAttributeYouWantToPrint)) followed by the classification result.
Note that if the classification result is nominal value you can print it using this:
testSet.classAttribute().value((int)res[j]))

Related

Can I make Amazon SageMaker deliver a recommendation based on historic data instead of a probability score?

We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)
For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.
predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.

Weka not display Correctly classified instances as output

I am new on weka. I have a dataset in csv with 5000 samples. here 20 samples of it; when I upload this dataset into weka, it looks ok, but when I run knn algorithm it gives a result that is not supposed to give. here is the sample data.
a,b,c,d
74,85,123,1
73,84,122,1
72,83,121,1
70,81,119,1
70,81,119,1
69,80,118,1
70,81,119,1
70,81,119,1
76,87,125,1
76,87,125,1
82,92,146,2
74,86,140,2
68,80,134,2
64,76,130,2
64,75,132,2
83,96,152,2
72,85,141,2
71,83,141,2
69,81,139,2
65,79,137,2
here is the result :
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.6148
Mean absolute error 0.2442
Root mean squared error 0.4004
Relative absolute error 50.2313 %
Root relative squared error 81.2078 %
Total Number of Instances 5000
it is supposed to give this kind of result like:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
What should be the problem? What am I missing? I did this in all other algorithms but they all give the same output. I have used sample weka datasets, they all work as expected.
The IBk algorithm can be used for regression (predicting the value of a numeric response for each instance) as well as for classification (predicting which class each instance belongs to).
It looks like all the values of the class attribute in your dataset (column d in your CSV) are numbers. When you load this data into Weka, Weka therefore guesses that this attribute should be treated as a numeric one, not a nominal one. You can tell this has happened because the histogram in the Preprocess tab looks something like this:
instead of like this (coloured by class):
The result you're seeing when you run IBk is the result of a regression fit (predicting a numeric value of column d for each instance) instead of a classification (selecting the most likely nominal value of column d for each instance).
To get the result you want, you need to tell Weka to treat this attribute as nominal. When you load the csv file in the Preprocess tab, check Invoke options dialog in the file dialog window. Then when you click Open, you'll get this window:
The field nominalAttributes is where you can give Weka a list of which attributes are nominal ones even if they look numeric. Entering 4 here will specify that the fourth attribute (column) in the input is a nominal attribute. Now IBk should behave as you expect.
You could also do this by applying the NumericToNominal unsupervised attribute filter to the already loaded data, again specifying attribute 4 otherwise the filter will apply to all the attributes.
The ARFF format used for the Weka sample datasets includes a specification of which attributes are which type. After you've imported (or filtered) your dataset as above, you can save it as ARFF and you'll then be able to reload it without having to go through the same process.

same predictions for all the inputs with a fine tuned inception v3 model

I am trying to fine tune an inception v3 model with 2 categories . These are the steps i followed 1. created sharded files from custom data using build_image_data.py by changing the number of classes and examples in imagenet_data.py. Used a labelsfile.txt ; 2. changed the values accordingly in flowers_data.py and using flowers_train.py I trained the model. ; 3. I froze the model and got protobuf file. ; 4. My input node (x) expects a batch of size 32 and size 299x299x3 so I hacked my way by duplicating my test image 32 times and created an input batch ; 5. using input&output nodes, input batch and the script below, I am able to print the scores of predictions
image_data = create_test_batch(args.image_name)
graph=load_graph(args.frozen_model_filename)
x = graph.get_tensor_by_name('prefix/batch_processing/Reshape:0')
y = graph.get_tensor_by_name('prefix/tower_0/logits/predictions:0')
with tf.Session(graph=graph) as sess:
y_out=sess.run(y, feed_dict={x:image_data})
print(y_out)
I got the result which looks like:
[[ 0.02264258 0.16756369 0.80979371][ 0.02351799 0.16782859 0.80865341].... [ 0.02205461 0.1794569 0.7984885 ][ 0.02153662 0.16436867 0.81409472]](32)
For any image as input, I have been getting the maximum score only in column 3 which means I'd get the same prediction for any input.
Is there any point which I am missing in my process? Can anyone help me with this issue?
I am using python 2.7 in ubuntu 16.04 in cloudVM
Hi I was also facing the same problem but I found out that I didn't preprocess my test set in the same way that I preprocessed my train set, in my case I fixed the problem by following the same preprocessing steps for both of my test and train set.
This is the problem in mine:
Training:
train_image=[]
for i in files:
img=cv2.imread(i)
img=cv2.resize(img,(100,100))
img=image.img_to_array(img)
img=img/255
train_image.append(img)
X=np.array(train_image)
But while preprocessing the test I forgot to normalize my "img" (means doing img=img/255) so later adding this img=img/255 step in my test set I solved my problem.
Test Set:
img=cv2.imread(i)
img=cv2.resize(img,(100,100))
img=image.img_to_array(img)
img = img/255
test_image.append(img)
I had a similar issue recently and what I found out was I pre-processed my test data differently from the method in the fine tune program I was using. Basically the data ranges are different - for training images, the pixels range from 0 to 255 while for test images, the pixels range from 0 to 1. That's why when I fed my test data into the model, the model will output the same predictions all the time, since the pixels of the test data are in such a small range that it doesn't make any difference to the model.
Hope that helps even though it might not be your case.

Building Speech Dataset for LSTM binary classification

I'm trying to do binary LSTM classification using theano.
I have gone through the example code however I want to build my own.
I have a small set of "Hello" & "Goodbye" recordings that I am using. I preprocess these by extracting the MFCC features for them and saving these features in a text file. I have 20 speech files(10 each) and I am generating a text file for each word, so 20 text files that contains the MFCC features. Each file is a 13x56 matrix.
My problem now is: How do I use this text file to train the LSTM?
I am relatively new to this. I have gone through some literature on it as well but not found really good understanding of the concept.
Any simpler way using LSTM's would also be welcome.
There are many existing implementation for example Tensorflow Implementation, Kaldi-focused implementation with all the scripts, it is better to check them first.
Theano is too low-level, you might try with keras instead, as described in tutorial. You can run tutorial "as is" to understand how things goes.
Then, you need to prepare a dataset. You need to turn your data into sequences of data frames and for every data frame in sequence you need to assign an output label.
Keras supports two types of RNNs - layers returning sequences and layers returning simple values. You can experiment with both, in code you just use return_sequences=True or return_sequences=False
To train with sequences you can assign dummy label for all frames except the last one where you can assign the label of the word you want to recognize. You need to place input and output labels to arrays. So it will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,...,1], [0,0,....,2]]
In X every element is a vector of 13 floats. In Y every element is just a number - 0 for intermediate frames and word ID for final frame.
To train with just labels you need to place input and output labels to arrays and output array is simpler. So the data will be:
X = [[word1frame1, word1frame2, ..., word1framen],[word2frame1, word2frame2,...word2framen]]
Y = [[0,0,1], [0,1,0]]
Note that output is vectorized (np_utils.to_categorical) to turn it to vectors instead of just numbers.
Then you create network architecture. You can have 13 floats for input, a vector for output. In the middle you might have one fully connected layer followed by one lstm layer. Do not use too big layers, start with small ones.
Then you feed this dataset into model.fit and it trains you the model. You can estimate model quality on heldout set after training.
You will have a problem with convergence since you have just 20 examples. You need way more examples, preferably thousands to train LSTM, you will only be able to use very small models.

WEKA: how to get the score from classifyInstance?

I'm using a FilteredClassifier.classifyInstance() to classify my instances in weka.
I have 2 classes (true and false) and I have many positives, so I actually need to know the score of each isntance to get the best positive.
You know how I could get the score from my weka classifier ?
thanks
Update: I've also tried to use distributionForInstance, but for each instance I always get an array with [1.0, 0.0].
I actually need to compare several instances to see which one is the most reliable, which one has more changes to have been classified correctly.
distributionForInstance(Instance anInstance) is the method you need. It gives you a Double array showing the confidence for each of your classes. I am using Weka 3.6. and it works well for me. If you always get the same values, your classifier is not trained well and not discriminative at all. In that case, you should always get the same class predicted. Did you balance your training set?
distributionForInstance(Instance anInstance) seems right.
Maybe it is not working for you because the classifier doesn't know you'd need the confidence values? For example for LibSVM on Weka Java, you need to set setProbabilityEstimates to true, in order to use the scores.
After you have run the classifier on your data, you can visualize the data by right clicking on the test in the " Result list " There are lots of other funcitons on this right click menu that will allow you to gain scores from weka classifiers.
Suppose that your model is already trained.
Then, you can make predictions with distributionForInstance. This command produces an array consisting of two items (because there are two classes on your dataset: true and false)
double[] distributions = model.distributionForInstance(new_instance);
After then, index of the greatest item in distributions array would be classification result.
Assume that distributions = {0.9638458988630731, 0.03615410113692686}. In this case, your new instance would be classified as class_0 because 1st item is greater than 2nd item in distributions array.
You can also get this index with classifyInstance command.
double classifiedIndex = model.classifyInstance(new_instance);
classifiedIndex value would be 0 for distributions = {0.9638458988630731, 0.03615410113692686}.
Finally, you can get the class name as true or false instead of class index.
new_instance.setClassValue(classifiedIndex); //firstly, assigned classified index to new_instance.
String classifiedText = new_instance.stringValue(new_instance.numAttributes());
This code block produces false.
You might examine this GitHub project for both regression and classification.