Weka classification and predicted class - weka

I'm trying to classify an unlabelled string using Weka, I'm not an expert in data mining so i have been struggling with the different terms. What I'm doing is I am providing the training data and setting the unlabeled string after running the M5Rules classifier, I'm actually getting an output but i have no idea what it mean:
run:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
Results
======
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1
BUILD SUCCESSFUL (total time: 1 second)
The source code is as follows:
public Categorizer(){
try{
//*** READ ARRF FILES *///////////////////////////////////////////////////////
//BufferedReader trainReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/training-data.arff"));//File with text examples
//BufferedReader classifyReader = new BufferedReader(new FileReader("c:/Users/Yehia A.Salam/Desktop/dd/test-data.arff"));//File with text to classify
// Create trainning data instance
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/training-data"));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(dataRaw);
Instances dataTraining = Filter.useFilter(dataRaw, filter);
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
// Create test data instances
loader.setDirectory(new File("c:/Users/Yehia A.Salam/Desktop/dd/test-data"));
dataRaw = loader.getDataSet();
Instances dataTest = Filter.useFilter(dataRaw, filter);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
// Classify
FilteredClassifier model = new FilteredClassifier();
model.setFilter(new StringToWordVector());
model.setClassifier(new M5Rules());
model.buildClassifier(dataTraining);
for (int i = 0; i < dataTest.numInstances(); i++) {
dataTest.instance(i).setClassMissing();
double cls = model.classifyInstance(dataTest.instance(i));
dataTest.instance(i).setClassValue(cls);
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(dataTraining);
eval.evaluateModelOnce(cls, dataTest.instance(i));
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
catch(FileNotFoundException e){
System.err.println(e.getMessage());
}
catch(IOException i){
System.err.println(i.getMessage());
}
catch(Exception o){
System.err.println(o.getMessage());
}
}
And finally a couple of screenshots in case i made anything wrong in the folder hierarchy:

tl;dr:
You set the class index to a random feature
You have to use a classifier, not a regression algorithm
The problem is how you initialize your data sets. Although weka usually puts the class in the last column, the TextDirectoryLoader doesn't. In fact, you don't need to set the class index manually, it is already set, so remove the lines
dataTraining.setClassIndex(dataRaw.numAttributes() - 1);
dataTest.setClassIndex(dataTest.numAttributes() - 1);
(The first line is wrong anyway, because you use the number of attributes from the raw data set, but choose the column of the already filtered data set.)
If you then run your code, you will get this:
weka.classifiers.functions.LinearRegression: Cannot handle binary class!
As I already guessed, M5Rules is not a classifier, but for regression. If you use a classifier like J48 or RandomForest, you will get a more sensible output. Just change the line
model.setClassifier(new M5Rules());
to
model.setClassifier(new RandomForest());
As for your output, here is what I make of it:
{17 1,35 1,64 1,135 1,205 1,214 1,215 1,284 1,288 1,309 1,343 1,461 1,493 1,500 1,552 1,806 -0.038168} | -0.03816793850062397
-0.03816793850062397 ->
is the result of the lines
System.out.println(dataTest.instance(i).toString() + " | " + cls);
System.out.println(cls + " -> " + dataTest.instance(i).classAttribute().value((int) cls));
So you see the features of your instance serialized as sparse ARFF followed by | and the class.
Usually, the class should be an integer, but from the documentation of M5Rules I get that it is a classifier for regression problems, so you won't get discrete classes, but continuous values, in your case -0.03816793850062397
Since you (incorrectly) set a numerical feature as class label, M5Rules didn't complain and gave you an output. If you use an actual classifier, you will get your labels "health" or "travel".
The rest are standard statistics about the classifiers performance, but they are pretty useless for only one classifier instance. It looks like the one sample was classified correctly, so all errors are zero.
Correlation coefficient 0
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0 %
Root relative squared error 0 %
Total Number of Instances 1

Just in case someone else got the same error with M5P, try to see if the Arff is just a header or empty.
Otherwise try
model.buildClassifier(....)
instead of
model.setClassifier(....);
That solved it for me.

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

Using For loop on nested list

I'm using a nested list to hold data in a Cartesian coordinate type system.
The data is a list of categories which could be 0,1,2,3,4,5,255 (just 7 categories).
The data is held in a list formatted thus:
stack = [[0,1,0,0],
[2,1,0,0],
[1,1,1,3]]
Each list represents a row and each element of a row represents a data point.
I'm keen to hang on to this format because I am using it to generate images and thus far it has been extremely easy to use.
However, I have run into problems running the following code:
for j in range(len(stack)):
stack[j].append(255)
stack[j].insert(0, 255)
This is intended to iterate through each row adding a single element 255 to the start and end of each row. Unfortunately it adds 12 instances of 255 to both the start and end!
This makes no sense to me. Presumably I am missing something very trivial but I can't see what it might be. As far as I can tell it is related to the loop: if I write stack[0].append(255) outside of the loop it behaves normally.
The code is obviously part of a much larger script. The script runs multiple For loops, a couple of which are range(12) but which should have closed by the time this loop is called.
So - am I missing something trivial or is it more nefarious than that?
Edit: full code
step_size = 12, the code above is the part that inserts "right and left borders"
def classify(target_file, output_file):
import numpy
import cifar10_eval # want to hijack functions from the evaluation script
target_folder = "Binaries/" # finds target file in "Binaries"
destination_folder = "Binaries/Maps/" # destination for output file
# open the meta file to retrieve x,y dimensions
file = open(target_folder + target_file + "_meta" + ".txt", "r")
new_x = int(file.readline())
new_y = int(file.readline())
orig_x = int(file.readline())
orig_y = int(file.readline())
segment_dimension = int(file.readline())
step_size = int(file.readline())
file.close()
# run cifar10_eval and create predictions vector (formatted as a list)
predictions = cifar10_eval.map_interface(new_x * new_y)
del predictions[(new_x * new_y):] # get rid of excess predictions (that are an artefact of the fixed batch size)
print("# of predictions: " + str(len(predictions)))
# check that we are mapping the whole picture! (evaluation functions don't necessarily use the full data set)
if len(predictions) != new_x * new_y:
print("Error: number of predictions from cifar10_eval does not match metadata for this file")
return
# copy predictions to a nested list to make extraction of x/y data easy
# also eliminates need to keep metadata - x/y dimensions are stored via the shape of the output vector
stack = []
for j in range(new_y):
stack.append([])
for i in range(new_x):
stack[j].append(predictions[j*new_x + i])
predictions = None # clear the variable to free up memory
# iterate through map list and explode each category to cover more pixels
# assigns a step_size x step_size area to each classification input to achieve correspondance with original image
new_stack = []
for j in range(len(stack)):
row = stack[j]
new_row = []
for i in range(len(row)):
for a in range(step_size):
new_row.append(row[i])
for b in range(step_size):
new_stack.append(new_row)
stack = new_stack
new_stack = None
new_row = None # clear the variables to free up memory
# add a border to the image to indicate that some information has been lost
# border also ensures that map has 1-1 correspondance with original image which makes processing easier
# calculate border dimensions
top_and_left_thickness = int((segment_dimension - step_size) / 2)
right_thickness = int(top_and_left_thickness + (orig_x - (top_and_left_thickness * 2 + step_size * new_x)))
bottom_thickness = int(top_and_left_thickness + (orig_y - (top_and_left_thickness * 2 + step_size * new_y)))
print(top_and_left_thickness)
print(right_thickness)
print(bottom_thickness)
print(len(stack[0]))
# add the right then left borders
for j in range(len(stack)):
for b in range(right_thickness):
stack[j].append(255)
for b in range(top_and_left_thickness):
stack[j].insert(0, 255)
print(stack[0])
print(len(stack[0]))
# add the top and bottom borders
row = []
for i in range(len(stack[0])):
row.append(255) # create a blank row
for b in range(top_and_left_thickness):
stack.insert(0, row) # append the blank row to the top x many times
for b in range(bottom_thickness):
stack.append(row) # append the blank row to the bottom of the map
# we have our final output
# repackage this as a numpy array and save for later use
output = numpy.asarray(stack,numpy.uint8)
numpy.save(destination_folder + output_file + ".npy", output)
print("Category mapping complete, map saved as numpy pickle: " + output_file + ".npy")

Retrieve values from an array - get "cannot call value of non-function type String"

I'm trying to retrieve a value from an array, based on an index parsed from a string of digits. I'm stuck on this error, and the other answers to similar questions in this forum appear to be for more advanced developers (this is my first iOS app).
The app will eventually look up weather reports ("MAFOR" groupings of 5 digits each) from a web site, parse each group and lookup values from arrays for wind direction, speed, forecast period etc using each character.
The playground code is below, appreciate any help on where I am going wrong (look for ***)
//: Playground - noun: a place where people can play
import UIKit
var str = "Hello, playground"
// create array for Forecast Period
let forecastPeriodArray = ["Existing conditions at beginning","3 hours","6 hours","9 hours","12 hours","18 hours","24 hours","48 hours","72 hours","Occasionally"]
// create array for Wind Direction
let windDirectionArray = ["Calm","Northeast","East","Southeast","South","Southwest","West","Northwest","North","Variable"]
// create array for Wind Velocity
let windVelocityArray = ["0-10 knots","11-16 knots","17-21 knots","22-27 knots","28-33 knots","34-40 knots","41-47 knots","48-55 knots","56-63 knots","64-71 knots"]
// create array for Forecast Weather
let forecastWeatherArray = ["Moderate or good visibility (> 3 nm.","Risk of ice accumulation (temp 0C to -5C","Strong risk of ice accumulkation (air temp < -5C)","Mist (visibility 1/2 to 3 nm.)","Fog (visibility less than 1/2 nm.)","Drizzle","Rain","Snow, or rain and snow","Squally weather with or without showers","Thunderstorms"]
// retrieve full MAFOR line of several information groups (this will be pulled from a web site)
var myMaforLineString = "11747 19741 13757 19751 11730 19731 11730 13900 11630 13637"
// split into array components wherever " " is encountered
var myMaforArray = myMaforLineString.components(separatedBy: " ")
let count = myMaforArray.count
print("There are \(count) items in the array")
// Go through each group and parse out the needed digits
for maforGroup in myMaforArray {
print("MAFOR group \(maforGroup)")
// get Forecast Period
var idx = maforGroup.index(maforGroup.startIndex, offsetBy: 1)
var periodInt = maforGroup[idx]
print("periodInt is \(periodInt)")
// *** here is where I am stuck... trying to use the periodInt index value to retrieve the description from the ForecastPeriodArray
var periodDescription = forecastPeriodArray(periodInt)
print("Forecast period = (forecastPeriodArray(periodInt)")
// get Wind Direction
idx = maforGroup.index(maforGroup.startIndex, offsetBy: 2)
var directionInt = maforGroup[idx]
print("directionInt is \(directionInt)")
// get Wind Velocity
idx = maforGroup.index(maforGroup.startIndex, offsetBy: 3)
var velocityInt = maforGroup[idx]
print("velocityInt is \(velocityInt)")
// get Weather Forecast
idx = maforGroup.index(maforGroup.startIndex, offsetBy: 4)
var weatherInt = maforGroup[idx]
print("weatherInt is \(weatherInt)")
}
#shallowThought was close.
You are trying to access an array by its index, therefore use the array[index] notation. But your index has to be of the correct type. forecastPeriodArray[periodInt] therefore does not work since periodInt is not an Int as the name would suggest. Currently it is of type Character which does not make much sense.
What you are probably trying to achieve is convert the character to an integer and use that to access the array:
var periodInt = Int(String(maforGroup[idx]))!
You might want to add error handling for the case when the character does not actually represent an integer.

Why does Relation.size sometimes return a Hash in Rails 4

I can run a query in two different ways to return a Relation.
When I interrogate the size of the Relation one query gives a Fixnum as expected the other gives a Hash which is a hash of each value in the Relations Group By statement with the number of occurrences of each.
In Rails 3 I assume it always returned a Fixnum as I never had a problem whereeas with Rails 4 it sometimes returns a Hash and a statement like Rel.size.zero? gives the error:
undefined method `zero?' for {}:Hash
Am I best just using the .blank? method to check for zero records to be sure of avoiding unexpected errors?
Here is a snippet of code with looging statements for the two queries and the resulting log
CODE:
assessment_responses1=AssessmentResponse.select("process").where("client_id=? and final = ?",self.id,false).group("process")
logger.info("-----------------------------------------------------------")
logger.info("assessment_responses1.class = #{assessment_responses1.class}")
logger.info("assessment_responses1.size.class = #{assessment_responses1.size.class}")
logger.info("assessment_responses1.size value = #{assessment_responses1.size}")
logger.info("............................................................")
assessment_responses2=AssessmentResponse.select("distinct process").where("client_id=? and final = ?",self.id,false)
logger.info("assessment_responses2.class = #{assessment_responses2.class}")
logger.info("assessment_responses2.size.class = #{assessment_responses2.size.class}")
logger.info("assessment_responses2.size values = #{assessment_responses2.size}")
logger.info("-----------------------------------------------------------")
LOG
-----------------------------------------------------------
assessment_responses1.class = ActiveRecord::Relation::ActiveRecord_Relation_AssessmentResponse
(0.5ms) SELECT COUNT(`assessment_responses`.`process`) AS count_process, process AS process FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0) GROUP BY process
assessment_responses1.size.class = Hash
CACHE (0.0ms) SELECT COUNT(`assessment_responses`.`process`) AS count_process, process AS process FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0) GROUP BY process
assessment_responses1.size value = {"6 Month Review(1)"=>3, "Assessment(1)"=>28, "Assessment(2)"=>28}
............................................................
assessment_responses2.class = ActiveRecord::Relation::ActiveRecord_Relation_AssessmentResponse
(0.5ms) SELECT COUNT(distinct process) FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0)
assessment_responses2.size.class = Fixnum
CACHE (0.0ms) SELECT COUNT(distinct process) FROM `assessment_responses` WHERE `assessment_responses`.`organisation_id` = 17 AND (client_id=43932 and final = 0)
assessment_responses2.size values = 3
-----------------------------------------------------------
size on an ActiveRecord::Relation object translates to count, because the former tries to get the count of the Relation. But when you call count on a grouped Relation object, you receive a hash.
The keys of this hash are the grouped column's values; the values of this hash are the respective counts.
AssessmentResponse.group(:client_id).count # this will return a Hash
AssessmentResponse.group(:client_id).size # this will also return a Hash
This is true for the following methods: count, sum, average, maximum, and minimum.
If you want to check for rows being present or not, simply use exists? i.e. do the following:
AssessmentResponse.group(:client_id).exists?
Instead of this:
AssessmentResponse.group(:client_id).count.zero?

Log base 2 calculation in python

I am trying to calculate the average disorder in ID trees. My code is below:
Republican_yes = yes.count('Republican')
Democrat_yes = yes.count('Democrat')
Republican_no = no.count('Republican')
Democrat_no = no.count('Democrat')
Indep_yes = yes.count('Independent')
Indep_no = no.count('Independent')
disorder_yes= Republican_yes/len(yes)*(math.log(float(Republican_yes)/len(yes),2))+ Democrat_yes/len(yes)*(math.log(float(Democrat_yes)/len(yes),2))+Indep_yes/len(yes)*(math.log(float(Indep_yes)/len(yes),2))
disorder_no= Republican_no/len(no)*(math.log(float(Republican_no)/len(no),2))+Democrat_no/len(no)*(math.log(float(Democrat_no)/len(no),2))+Indep_no/len(no)*(math.log(float(Indep_no)/len(no),2))
avgdisorder = -len(yes)/(len(yes)+len(no))*disorder_yes - len(no)/(len(yes)+len(no))*disorder_no
return avgdisorder
why do I keep getting math domain error?
Check if the lengths are 0 or not, else you will get MathError.
if len(yes):
disorder_yes= Republican_yes/len(yes)*(math.log(float(Republican_yes)/len(yes),2))+ Democrat_yes/len(yes)*(math.log(float(Democrat_yes)/len(yes),2))+Indep_yes/len(yes)*(math.log(float(Indep_yes)/len(yes),2))
if len(no):
disorder_no= Republican_no/len(no)*(math.log(float(Republican_no)/len(no),2))+Democrat_no/len(no)*(math.log(float(Democrat_no)/len(no),2))+Indep_no/len(no)*(math.log(float(Indep_no)/len(no),2))
if len(yes) or len(no):
avgdisorder = -len(yes)/(len(yes)+len(no))*disorder_yes - len(no)/(len(yes)+len(no))*disorder_no
If you want, you can always add the else clause for all 3 if statements as per your requirement.