Interpreting classifier output parameters like -K, -W, -A in KNN in weka - weka

I am looking for how to match the hyperparameter alphabets names in Weka output. I need to know this to make changes to parameters in any classifier. Also then use the same value in scikitlearn in case.

Check OptionHandler.listOptions method:
listOptions method returns an enumeration of all the available options.

Related

In Weka, how can I use GridSearch to fine tune 5 algorithms?

In Weka, I want to apply 5 algorithms (NB, SVM, LR, KNN, and RF). Further, I want to fine-tune their parameters by using GridSearch.
My question is: How can I know the values of their parameters?
Instead of using the GridSearch package, I recommend using MultiSearch instead.
The latter allows you to optimize not just two parameters, but also a single one or more than two. Of course, the more parameters (and individual parameter values) you want to explore, the larger the search space and the longer the evaluation will take.
Furthermore, it also supports non-numeric parameters (eg boolean - use ListParameter in that case instead of MathParameter).
In terms of what parameters to optimize, you have to look through the parameters in the Weka GUI, e.g., the Weka Explorer, and make a decision.
When you click on a classifier's panel, you bring up its options in the GenericObjectEditor. When clicking on the More button in that dialog, you can view the description of the available options.
E.g., the options for NaiveBayes are as follows:
OPTIONS
useKernelEstimator -- Use a kernel estimator for numeric attributes rather than a normal distribution.
numDecimalPlaces -- The number of decimal places to be used for the output of numbers in the model.
batchSize -- The preferred number of instances to process if batch prediction is being performed. More or fewer instances may be provided, but this gives implementations a chance to specify a preferred batch size.
debug -- If set to true, classifier may output additional info to the console.
displayModelInOldFormat -- Use old format for model output. The old format is better when there are many class values. The new format is better when there are fewer classes and many attributes.
doNotCheckCapabilities -- If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime).
useSupervisedDiscretization -- Use supervised discretization to convert numeric attributes to nominal ones.
Only useKernelEstimator and useSupervisedDiscretization will influence the model, all other options only influence the display of the model or output debugging information.
These names (aka properties) you can then use in MultiSearch to reference the options that you want to evaluate.
E.g., here are the two ListParameter setups for NaiveBayes that will make up the search space:
weka.core.setupgenerator.ListParameter -property useKernelEstimator -list "false true"
weka.core.setupgenerator.ListParameter -property useSupervisedDiscretization -list "false true"
However, since these two parameters cannot be used in conjunction, you will have to wrap each in a ParameterGroup object to have them explored separately:
weka.core.setupgenerator.ParameterGroup -search "weka.core.setupgenerator.ListParameter -property useSupervisedDiscretization -list \"false true\""
weka.core.setupgenerator.ParameterGroup -search "weka.core.setupgenerator.ListParameter -property useKernelEstimator -list \"false true\""
The best setup (as a command-line) will be output with the classifier model, e.g.:
weka.classifiers.meta.MultiSearch:
Classifier: weka.classifiers.bayes.NaiveBayes -K

Why doesn't this data work with the Naïve Bayes algorithm?

It is in ARFF format. If you're not familiar with ARFF, it's basically that everything under the #data marker is in CSV.
For clarification, I am trying to use the dataset on Weka but the option to use Naïve Bayes is greyed out.
Every classifier, clusterer, filter etc in Weka can only handle certain types of data, i.e., its capabilities (which you can check in GUI). These capabilities are then compared against the data. In case of a mismatch, the GUI won't allow you to apply the algorithm.
Long story short: the dport attribute is of type string which NaiveBayes can't handle. You can convert that attribute into a nominal one using the StringToNominal filter.

Empty confusion matrix in Weka with test data

I am classifying iris data using DECISION TREE (C4.5), RANDOM FOREST and NAIVE BAYES. I am using the dataset downloaded from iris-train and iris-test. When I train the all networks everything is fine with proper results with 'classifier output', 'Detailed accuracy with class' and 'confusion matrix'. But, when I select the iris-test data in the Weka-explorer-classify-test options and select the iris-test file and in 'more options' select 'output prediction' as 'csv' and click start, I am getting the result as shown in the figure below. The 'classifier output' is showing the classified samples correctly, but, 'Detailed accuracy with class' and 'confusion matrix' is with all values zeros. Any suggestion where I am going wrong in selecting any option. Thank you.
The confusion matrix shows you how well your trained classifier performs by comparing the actual class of the instances in the test set with the class that was predicted by the classifier. But you are supplying a test set with no class information, so there's nothing to compare against. This is why you see
Total Number of Instances 0
Ignored Class Unknown Instances 120
in the output in your screenshot.
Typically you would first evaluate the performance of your classifier using cross-validation, or a test set that has class information. Then you can use the trained classifier to classify unknown data, for example using the Re-evaluate model on current test set right-click option as described in the help.

StringToWordVector in Weka

What is StringToWordVector? All I know about it is that it converts a string attribute to multiple attributes. But what is the advantage of doing so and how an object of StringToWordVector class serves as a filter for FilteredClassifier? How has it become a filter?
StringTOWordVector is the filter class in weka which filters strings into N-grams using WOrdTokenizer class. This helps us to provide strings as N-grams to classifier. Besides just tokenizing, it also provide other functionalities like removing Stopwords, weighting words with TFIDF, output word count rather than just indicating word is present or not, pruning rate, stemming, Lowercase conversion of words, etc. Detailed explanation of this class can be found at http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVecing.html So Basically it provides basic functionalities which helps us to fine tune the training set according to requirements before training.
However, if someone, who wants to perform testing along with training, must use batchfiltering or Filtered classifier for ensuring compatability of train & test Set. This is because if we pass train & test separately through StringToWordVector then it will generate different vocabulary for train & test set. To decide which technique should be opted out of batch filltering & Filtered classifier, follow the post by Nihil Obstat at http://jmgomezhidalgo.blogspot.in/2013/01/text-mining-in-weka-chaining-filters.html
Hope this helps.

RapidMiner: Can I use a wildcard as an attribute value for training a decision tree model?

I am working on a fairly simple process in RapidMiner 5.3.013, which reads a CSV file and uses it as a training set to train the decision tree classifier. The result of the process is the model. A second CSV is read and used as the unlabeled set. The model (calculated earlier) is applied to the unlabeled test set, in an effort to label it properly.
Each line of the CSVs contains a few attributes, for example:
15, 0, 1555, abc*15, label1
but some lines of the training set may be like this:
15, 0, *, abc*15, label2
This is done because the third value may take various values, so the creator of the training set used a star as a wildcard in the place of the value.
What I would like to do is let the decision tree know that the star there means "match anything", so that it does not literally only match a star.
Notes:
the star in the 4th field (abc*15) should be matched literally and not as a wildcard.
if the 3rd field always contained stars, I could just not include it in the attributes, but that's not the case. Sometimes the 3rd field contains integer values, which should be matched literally.
I tried leaving the field blank, but it doesn't work
So, is there a way to use regular expressions, or at least a simple wildcard while training the classifier or using the model?
A different way to put it is: Can I instruct the classifier to not use some of the attributes in some of the entries (lines in the CSV)?
Thanks!
I would process the data so the missing value is valid in its own right and I would discretize the valid numbers to be in ranges.
In more detail, what I meant by missing is the situation where the value of an attribute is something like *. I would simply allow this to be one valid value that the attribute takes. For all the other values of this attribute, these are numerical so they need to be converted to a nominal value to be compatible with the now valid *.
It's fairly fiddly to do this and I haven't tried this but I would start with the operator Declare Missing Value to detect the * and make them missing. From there, I would use the operator Discretize by Binning to convert numbers into nominal values. Finally, I would use Replace Missing Values to change the missing values to a nominal value like Missing. You might ask why bother with the first Declare Missing step above? The reason is that it will allow the Discretizing operation to work because it will be working on numbers alone given that non-numbers are marked as missing.
The resulting example set then be passed to a model in the normal way. Obviously, the model has to be able to cope with nominal attributes (Decision trees does).
It occurred to me that some modelling operators are more tolerant of missing data. I think k-nearest-neighbours may be one. In this case, you could simply mark the missing ones as above and not bother with the discretizing step.
The whole area of missing data does need care because it's important to understand the source of missingness. If missing data is correlated with other attributes or with the label itself, handling it inappropriately can skew results.