In Weka, how can I use GridSearch to fine tune 5 algorithms? - weka

In Weka, I want to apply 5 algorithms (NB, SVM, LR, KNN, and RF). Further, I want to fine-tune their parameters by using GridSearch.
My question is: How can I know the values of their parameters?

Instead of using the GridSearch package, I recommend using MultiSearch instead.
The latter allows you to optimize not just two parameters, but also a single one or more than two. Of course, the more parameters (and individual parameter values) you want to explore, the larger the search space and the longer the evaluation will take.
Furthermore, it also supports non-numeric parameters (eg boolean - use ListParameter in that case instead of MathParameter).
In terms of what parameters to optimize, you have to look through the parameters in the Weka GUI, e.g., the Weka Explorer, and make a decision.
When you click on a classifier's panel, you bring up its options in the GenericObjectEditor. When clicking on the More button in that dialog, you can view the description of the available options.
E.g., the options for NaiveBayes are as follows:
OPTIONS
useKernelEstimator -- Use a kernel estimator for numeric attributes rather than a normal distribution.
numDecimalPlaces -- The number of decimal places to be used for the output of numbers in the model.
batchSize -- The preferred number of instances to process if batch prediction is being performed. More or fewer instances may be provided, but this gives implementations a chance to specify a preferred batch size.
debug -- If set to true, classifier may output additional info to the console.
displayModelInOldFormat -- Use old format for model output. The old format is better when there are many class values. The new format is better when there are fewer classes and many attributes.
doNotCheckCapabilities -- If set, classifier capabilities are not checked before classifier is built (Use with caution to reduce runtime).
useSupervisedDiscretization -- Use supervised discretization to convert numeric attributes to nominal ones.
Only useKernelEstimator and useSupervisedDiscretization will influence the model, all other options only influence the display of the model or output debugging information.
These names (aka properties) you can then use in MultiSearch to reference the options that you want to evaluate.
E.g., here are the two ListParameter setups for NaiveBayes that will make up the search space:
weka.core.setupgenerator.ListParameter -property useKernelEstimator -list "false true"
weka.core.setupgenerator.ListParameter -property useSupervisedDiscretization -list "false true"
However, since these two parameters cannot be used in conjunction, you will have to wrap each in a ParameterGroup object to have them explored separately:
weka.core.setupgenerator.ParameterGroup -search "weka.core.setupgenerator.ListParameter -property useSupervisedDiscretization -list \"false true\""
weka.core.setupgenerator.ParameterGroup -search "weka.core.setupgenerator.ListParameter -property useKernelEstimator -list \"false true\""
The best setup (as a command-line) will be output with the classifier model, e.g.:
weka.classifiers.meta.MultiSearch:
Classifier: weka.classifiers.bayes.NaiveBayes -K

Related

Vector embeddings to mimic a ranking algorithm

Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.
I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.
I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.
Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.
There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.
Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.
You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).
And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.
Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:
https://en.wikipedia.org/wiki/Learning_to_rank
To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.
For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.
Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

Output to Table instead of Graph in Stata

In Stata, is there a way to redirect the data that a command does into a table instead of a graph?
Example: if someone created a normal probability distribution of data with the pnorm var_name command, is there a way to redirect the data so that instead of appearing in a graph, it appears in a table?
To add to #Noobie's answer:
Different commands work in different ways. There's no better short summary.
What you can look out for includes
generate() options that produce new variables. (There is absolute rule that the options have this name, but that or a similar name is the most common single variety.)
Options that allow saving results to new datasets.
Saved results, especially those visible after return list or ereturn list. These can be quite elaborate, e.g. saving of matrices of counts after tabulate.
More broadly, Stata commands aren't functions! One characteristic of a function, as so named in many languages or programs, is that there is a result, with special cases where the result is void or null. There clearly are statistical programs which in broad terms hinge on calling functions which have results, and what you see displayed is often a side-effect of that. Stata commands don't work like that in the sense that the results of a program can be various. In the case of commands designed just to show something, the "result" may be a display. It's worth noting that Mata, which underlies and underpins Stata, is more recognisably a C-like language, with (e.g.) many matrix extensions, which is based on functions (and much else).
Yes and no. It really depends on the command you are using. You should look at the help files first.
For instance, pnorm does not allow that. You can create the data yourself using the formula for pnorm described in the help file, where the cumulative distribution at some point is plotted against the so-called plotting position.
Other Stata commands allow you to generate the points directly. This is the case for kdensity for instance.

Weka: Classifier and ReplaceMissingValues

I am relatively new to the data mining area and have been experimenting with Weka.
I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.
I want to find the missing gender values based on the other data I do have.
I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)
Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.
I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.
It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.
A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.
You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.

Optimization run assistance

I am running the optimisation of two sets of data against each other and am after some assistance as to looking up settings of the run based on the calculated results. I'll explain....
I run 2 data lines against each other (think graph lines) - Line A and Line B. These lines have crossing points - upward and downward based on the direction of each line.e.g. Line A is going up and Line B is going down is an 'Upwards cross' and Line A going down and Line B going up is a 'Downward cross'.The program calculates financial analysis.
I analyze the crossing points and gain a resultant 'Rank' from the analysis based on a set of rules. The rank is a single integer.
Line A has a number of settings for the optimisation run e.g. Window 1 from a value of 10 to 20 and window 2 at a value of 30 to 40. Line B also has settings.
When I run the optimisation I iterate through the parameters available for each line and calculate the rank. The result of the optimisation run is a list of the ranks which is the size of the number of permutations avaliable.
So my question is this:
What is the best way to look up the line settings from the calculated rank using a position (index) in the rank list. The optimisation settings used to create the run will be stored for that rank run and can be used for the look-up.
I also will be adding additional parameters in the future to the system for the line so I want the program to take into account additional future line settings without affecting any rank files created previous to adding the new parameter.
In addition to that I want to be able to find out an index based on a particular setting included in the optimisation run (the reverse look-up of the previous method).
I want to avoid versioning for backward compatability if at all possible so that the lookup algorithm will be self-sufficient.
Is a hash table suitable for this purpose or do you have any implementation techniques that would fit better? Do you have any examples of this type of operation in action in C++?
Thanks,
Chris.
If I understand correctly, you have a bunch of associated data (settings + rank), on which you would like to be able to perform lookups with different key types. If so, then Boost.MultiIndex sounds like what you're looking for.