Remove Missing Values in Weka - weka

I'm using a dataset in Weka for classfication that includes missing values. As far as I understood, Weka replaces them automatically with the Modes or Mean of the training data (using the filter unsupervised/attribute/ReplaceMissingValues) when using a classifier like NaiveBayes.
I would like to try removing them, to see how this effects the quality of the classifier. Is there a filter to do that?

See this answer below for a better, modern approach.
My approach is not the perfect one because IF you have more than 5 or 6 attributes then it becomes quite cumbersome to apply but I can suggest that MultiFilter should be used for this purpose if only a few attributes have missing values.
If you have missing values in 2 attributes then you'll use RemoveWithValues 2 times in a MultiFilter.
Load your data in Weka Explorer
Select MultiFilter from the Filter area
Click on MultiFilter and Add RemoveWithValues
Then configure each RemoveWithValues filter with the attribute index and select True in matchMissingValues
Save the filter settings and click Apply in Explorer.

Use the removeIf() method on weka.core.Instances using the method reference from weka.core.Instance for the hasMissingValue method, which returns a boolean if a given Instance has any missing values.
Instances dataset = source.getDataSet(); // for some source
dataset.removeIf(Instance::hasMissingValue);

Related

How can I change the order of the attributes in Weka?

I was doing a machine learning task in Weka and the dataset has 486 attributes. So, I wanted to do attribute selection using chi-square and it provides me ranked attributes like below:
Now, I also have a testing dataset and I have to make it compatible. But how can I reorder the test attributes in the same manner that can be compatible with the train set?
Changing the order of attributes (e.g., when using the Ranker in conjunction with an attribute evaluator) will probably not have much influence on the performance of your classifier model (since all the attributes will stay in the dataset). Removing attributes, on the other hand, will more likely have an impact (for that, use subset evaluators).
If you want the ordering to get applied to the test set as well, then simply define your attribute selection search and evaluation schemes in the AttributeSelectedClassifier meta-classifier, instead of using the Attribute selection panel (that panel is more for exploration).

Nominal to binary conversion in weka tool

I was trying to preprocess Leukemia dataset which has two classes ALL and AML.I need to convert it into binary values. I used "nominal to binary" filter. But it does not convert it to binary values. My weka version is 3.6.11.
Well, on my 3.6 version of Weka, it is working.
1. Load the file on Explorer.
2. Go to Filter->Weka filters ->unsupervised->attribute->nominalToBinary.
3. In the attributeIndices, indicate the "nominal" attribute index that you are trying to change to "binary".
4. Leave all other options to default. Click OK.
5. Click apply.
To get the NominalToBinary filter to work on the class attribute,
make sure the attribute selected in the class dropdown is changed to another attribute, temporarily, then you can switch back after applying the filter.
Weka apparently does not let you apply the NominalToBinary filter on the selected class attribute.

Infragistics UltraGrid - How to use displayed values in group by headers when using an IEditorDataFilter?

I have a situation where I'm using the IEditorDataFilter interface within a custom UltraGrid editor control to automatically map values from a bound data source when they're displayed in the grid cells. In this case it's converting guid-based key values into user-friendly values, and it works well by displaying what I need in the cell, but retaining the GUID values as the 'value' behind the scenes.
My issue is what happens when I enable the built-in group by functionality and the user groups by a column using my editor. In that case the group by headers default to using the cell's value, which is the guid in my case, so I end up with headers like this:
Column A: 7F720CE8-123A-4A5D-95A7-6DC6EFFE5009 (10 items)
What I really want is the cell's display value to be used instead so it's something like this:
Column A: Item 1 (10 items)
What I've tried so far
Infragistics provides a couple mechanisms for modifying what's shown in group by rows:
GroupByRowDescriptionMask property of the grid (http://bit.ly/1g72t1b)
Manually set the row description via the InitializeGroupByRow event (http://bit.ly/1ix1CbK)
Option 1 doesn't appear to give me what I need because the cell's display value is not exposed in the set of tokens they provide. Option 2 looks promising but it's not clear to me how to get at the cell's display value. The event argument only appears to contain the cell's backing value, which in my case is the GUID.
Is there a proper approach for using the group by functionality when you're also using an IEditorDataFilter implementation to convert values?
This may be frowned upon, but I asked my question on the Infragistic forums as well, and a complete answer is available there (along with an example solution demonstrating the problem):
http://www.infragistics.com/community/forums/p/88541/439210.aspx
In short, I was applying my custom editors at the cell level, which made them unavailable when the rows were grouped together. A better approach would be to apply the editor at the column level, which would make the editor available at the time of grouping, and would provide the expected behavior.

what does the attribute selection in preprocess tab do in weka?

I cant seem to find out what attribute selection filter does in pre process tab? someone could please tell me in simple language as im new to weka
when i apply it to my dataset it seems to remove a couple of attributes but im unsure why
A real data set may contain many attributes. Applying any data mining process on this data set (e.g. finding clusters, generating a classification model ...) may take very long time.
Instead of that, we can select some attributes(dimensions) which is called the most discriminative attributes. These attributes can almost describe the data set with lower number of attributes and this will speed up any process done on the data.
Attribute selection tab contains many different methods for selecting these attributes. One of them is CFS Feature Set Evaluation This filter gives you the attributes that have higher correlation with the class label which makes them discriminative attributes.

Add new attribute calculated based on other attributes

I'm starting with WEKA and want to achieve the following.
I have file with 2 attributes: user_id, user_age.
I can successfully load data using WEKA API and get Instances object.
Now I want to calculate new attribute user_age_range - like (0-18) - 0, (19-25) - 1, etc.
Is there a way to calculate this attribute using WEKA Filters?
Also I would like not to iterate manually through all instances, but to define method that operates on single Instance and use some filter (or other abstraction) that'll apply corresponding "transformation" to all instances.
Please advice - how I could achieve this.
Thanks in advance.
After looking through the docs I found one or two filters that you could use in conjunction to achieve what you want.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/Copy.html
Use copy to create a copy that you will transform.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/NumericTransform.html
The numeric transform takes a class and a method option, you could write your own class that boxes the ages into the ranges you want and supply this class and method as your options.
Hope this helps
Using a csv file you can do that on Excel.
If you are using arff files, convert it to csv and then you can add the columns that you want depending on the number of new attributes and then just do whatever you want to do with one or more atributes on the first row. Extend that to all rows and it's done.