Discretize weka attributes into specific intervals - weka

I need to discretize a column in weka. The column name is age. It has numerical attributes. for an example values from 2-90.
I need to perform discretization based on a specific range of values process to discretize Age attribute based on the following categories.
Youth: 15 -<=25,Adult:>25-<=64,Senior:>64
How this is possible in Weka?
How can I label and adjust the intervals

Neither the supervised nor the unsupervised version of the Discretize filter will allow you to do that.
But you can achieve that goal by building a filter chain using MultiFilter:
Use MathExpression to apply your manual binning strategy using nested ifelse expressions. Set the ignoreRange to the attribute that you want to convert and also select invertSelection. As expression use something like: ifelse(A<=25,0,ifelse(A<=64,1,2)) (25 or lower will be turned into 0, 64 or lower into 1 and the rest into 2).
Convert the generated bin values into nominal labels using NumericToNominal. Define the attribute you want to convert in attributeIndices.
Finally, rename the numeric-looking labels 0,1,2 into more meaningful ones using RenameNominalValues. Specify the attribute you want to update in selectedAttributes and use 0:Youth,1:Adult,2:Senior as valueReplacements.
The following MultiFilter setup converts the 7th attribute in a dataset in such a fashion (just copy it and paste it in the Weka Explorer via the right-click menu):
weka.filters.MultiFilter -F "weka.filters.unsupervised.attribute.MathExpression -E ifelse(A<=25,0,ifelse(A<=64,1,2)) -V -R 7" -F "weka.filters.unsupervised.attribute.NumericToNominal -R 7" -F "weka.filters.unsupervised.attribute.RenameNominalValues -R 7 -N 0:Youth,1:Adult,2:Senior" -S 1

Related

In Weka, how can I use GridSearch to fine tune 5 algorithms?

In Weka, I want to apply 5 algorithms (NB, SVM, LR, KNN, and RF). Further, I want to fine-tune their parameters by using GridSearch.
My question is: How can I know the values of their parameters?
Instead of using the GridSearch package, I recommend using MultiSearch instead.
The latter allows you to optimize not just two parameters, but also a single one or more than two. Of course, the more parameters (and individual parameter values) you want to explore, the larger the search space and the longer the evaluation will take.
Furthermore, it also supports non-numeric parameters (eg boolean - use ListParameter in that case instead of MathParameter).
In terms of what parameters to optimize, you have to look through the parameters in the Weka GUI, e.g., the Weka Explorer, and make a decision.
When you click on a classifier's panel, you bring up its options in the GenericObjectEditor. When clicking on the More button in that dialog, you can view the description of the available options.
E.g., the options for NaiveBayes are as follows:
OPTIONS
useKernelEstimator -- Use a kernel estimator for numeric attributes rather than a normal distribution.
numDecimalPlaces -- The number of decimal places to be used for the output of numbers in the model.
batchSize -- The preferred number of instances to process if batch prediction is being performed. More or fewer instances may be provided, but this gives implementations a chance to specify a preferred batch size.
debug -- If set to true, classifier may output additional info to the console.
displayModelInOldFormat -- Use old format for model output. The old format is better when there are many class values. The new format is better when there are fewer classes and many attributes.
doNotCheckCapabilities -- If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime).
useSupervisedDiscretization -- Use supervised discretization to convert numeric attributes to nominal ones.
Only useKernelEstimator and useSupervisedDiscretization will influence the model, all other options only influence the display of the model or output debugging information.
These names (aka properties) you can then use in MultiSearch to reference the options that you want to evaluate.
E.g., here are the two ListParameter setups for NaiveBayes that will make up the search space:
weka.core.setupgenerator.ListParameter -property useKernelEstimator -list "false true"
weka.core.setupgenerator.ListParameter -property useSupervisedDiscretization -list "false true"
However, since these two parameters cannot be used in conjunction, you will have to wrap each in a ParameterGroup object to have them explored separately:
weka.core.setupgenerator.ParameterGroup -search "weka.core.setupgenerator.ListParameter -property useSupervisedDiscretization -list \"false true\""
weka.core.setupgenerator.ParameterGroup -search "weka.core.setupgenerator.ListParameter -property useKernelEstimator -list \"false true\""
The best setup (as a command-line) will be output with the classifier model, e.g.:
weka.classifiers.meta.MultiSearch:
Classifier: weka.classifiers.bayes.NaiveBayes -K

Partial results on Weka attribute selection

When I run PCA in WEKA GUI using "Select Attribute", I dont get a complete results instead a partial results with dots at the end.
0.8205 1 -0.493Capacity at 10th Cycle-0.483Capacity at 5th Cycle-0.473Capacity at 50th Cycle-0.261S [M]in Electrolyte -0.256C wt %...
Is there any way to solve this particular issue ?
By default, a maximum of 5 attribute names are included in the generated names.
If you want all of them, use -1 for the -A option (or maximumAttributeNames property in the GOE).

Weka not display Correctly classified instances as output

I am new on weka. I have a dataset in csv with 5000 samples. here 20 samples of it; when I upload this dataset into weka, it looks ok, but when I run knn algorithm it gives a result that is not supposed to give. here is the sample data.
a,b,c,d
74,85,123,1
73,84,122,1
72,83,121,1
70,81,119,1
70,81,119,1
69,80,118,1
70,81,119,1
70,81,119,1
76,87,125,1
76,87,125,1
82,92,146,2
74,86,140,2
68,80,134,2
64,76,130,2
64,75,132,2
83,96,152,2
72,85,141,2
71,83,141,2
69,81,139,2
65,79,137,2
here is the result :
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.6148
Mean absolute error 0.2442
Root mean squared error 0.4004
Relative absolute error 50.2313 %
Root relative squared error 81.2078 %
Total Number of Instances 5000
it is supposed to give this kind of result like:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
What should be the problem? What am I missing? I did this in all other algorithms but they all give the same output. I have used sample weka datasets, they all work as expected.
The IBk algorithm can be used for regression (predicting the value of a numeric response for each instance) as well as for classification (predicting which class each instance belongs to).
It looks like all the values of the class attribute in your dataset (column d in your CSV) are numbers. When you load this data into Weka, Weka therefore guesses that this attribute should be treated as a numeric one, not a nominal one. You can tell this has happened because the histogram in the Preprocess tab looks something like this:
instead of like this (coloured by class):
The result you're seeing when you run IBk is the result of a regression fit (predicting a numeric value of column d for each instance) instead of a classification (selecting the most likely nominal value of column d for each instance).
To get the result you want, you need to tell Weka to treat this attribute as nominal. When you load the csv file in the Preprocess tab, check Invoke options dialog in the file dialog window. Then when you click Open, you'll get this window:
The field nominalAttributes is where you can give Weka a list of which attributes are nominal ones even if they look numeric. Entering 4 here will specify that the fourth attribute (column) in the input is a nominal attribute. Now IBk should behave as you expect.
You could also do this by applying the NumericToNominal unsupervised attribute filter to the already loaded data, again specifying attribute 4 otherwise the filter will apply to all the attributes.
The ARFF format used for the Weka sample datasets includes a specification of which attributes are which type. After you've imported (or filtered) your dataset as above, you can save it as ARFF and you'll then be able to reload it without having to go through the same process.

Weka: ReplaceMissingValues for a test file

I am a bit worried when using Weka's ReplaceMissingValues to input the missing values only for the test arff dataset but not for the training dataset. Below is the commandline:
java -classpath weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -c last -i "test_file_with_missing_values.arff" -o "test_file_with_filled_missing_values.arff"
From a previous post (Replace missing values with mean (Weka)), I came to know that Weka's ReplaceMissingValues simply replace each missing value with the mean of the corresponding attribute. This implies that the mean needs to be computed for each attribute. While computation of this mean is perfectly fine for the training file, it is not okay for the test file.
This is because in the typical test scenario, we should not assume that we know the mean of the test attribute for the input missing values. We only have one test record with multiple attributes for classification instead of having the entire set of test records in a test file. Therefore, instead, we shall input the missing value based on the mean computed using the training data. Then above command would become incorrect as we would need to have another input (the means of the train attributes).
Has anybody thought about this before? How do you work around this by using weka?
Easy, see Batch Filtering
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Standardize filter = new Standardize();
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
The filter is initialized using the training data and then applied on both training and test data.
The problem is when you apply the ReplaceMissingValue filter outside any processing pipeline, because after writing the filtered data, you can't distinguish between "real" values and "imputed" values anymore. This is why you should do everything that needs to be done in a single pipeline, e.g., using the FilteredClassifier:
java -classpath weka.jar weka.classifiers.meta.FilteredClassifier
-t "training_file_with_missing_values.arff"
-T "test_file_with_missing_values.arff"
-F weka.filters.unsupervised.attribute.ReplaceMissingValues
-W weka.classifiers.functions.MultilayerPerceptron -- -L 0.3 -M 0.2 -H a
This example will initialize the ReplaceMissingValues filter using the "training_file_with_missing_values.arff" data set, then apply the filter on "test_file_with_missing_values.arff" (with the means learning from the training set), then train a multilayer perceptron on the filtered training data and predict the class of the filtered test data.

Weka: Train and test set are not compatible

I'm trying to classify some web posts using weka and naive bayes classifier.
First I manually classified many posts (about 100 negative and 100 positive) and I created an .arff file with this form:
#relation classtest
#attribute 'post' string
#attribute 'class' {positive,negative}
#data
'RT #burnreporter: Google has now indexed over 30 trillion URLs. Wow. #LeWeb',positive
'A special one for me Soundcloud at #LeWeb ',positive
'RT #dianaurban: Lost Internet for 1/2 hour at a conference called #LeWeb. Ironic, yes?',negative
.
.
.
Then I open Weka Explorer loading that file and applying the StringToWordVector filter to split the posts in single word attributes.
Then, after doing the same with my dataset, selecting (in classify tab of weka) naive bayes classifier and choosing select test set, it returns Train and test set are not compatible. What can I do? Thanks!
Probably the ordering of the attributes is different in train and test sets.
You can use batch filtering as described in http://weka.wikispaces.com/Batch+filtering
I used batch filter but still have problem. Here is what I did:
java -cp /usr/share/java/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -R last -b -i trainData.arff -o trainDataProcessed.csv.arff -r testData.arff -s testDataProcessed.csv.arff
I then get the error below:
Input file formats differ.
Later.I figured out two ways to make the trained model working on supplied test set.
Method 1.
Use knowledge flow. For example something like below: CSVLoader(for train set) -> classAssigner -> TrainingSetMaker -->(classifier of your choice) -> ClassfierPerformanceEvaluator - TextViewer. CSVLoader(for test set) -> classAssigner -> TestgSetMaker -->(the same classifier instance above) -> PredictionAppender -> CSVSaver. Then load the data from the CSVLoader or arffLoder for the training set. The model will be trained. After that load data from the loader for the test set. It will evaluate the model(classifier, for example) on the supplied test set and you can see the result from the textviewer (connected to the ClassifierPerformanceEvaluator) and get the saved result from the CSVSaver or arffSaver connected to the PredictionAppender.An additional column, the "classfied as" will be added to the output file. In my case, I used "?" for the class column in the supplied test set if the class labels are not available.
Method 2.
Combine the Training and Test set into one file. Then the exact same filter can be applied to both training and test set. Then you can separate training set and test set by applying instance filter. Since I use "?" as class label in the test set. It is not visible in the instance filter indices. Hence just select those indices that you can see in the attribute values to be removed when apply the instance filter. You will get the test data left only. Save it and load it in supply test set at the classifier page.This time it will work. I guess it is the class attribute that causes the NOT compatible train and test set issue. As many classfier requires nominal class attribute. The value of which is converted to the index to available values of the class attribute according to http://weka.wikispaces.com/Why+do+I+get+the+error+message+%27training+and+test+set+are+not+compatible%27%3F