what does the attribute selection in preprocess tab do in weka? - weka

I cant seem to find out what attribute selection filter does in pre process tab? someone could please tell me in simple language as im new to weka
when i apply it to my dataset it seems to remove a couple of attributes but im unsure why

A real data set may contain many attributes. Applying any data mining process on this data set (e.g. finding clusters, generating a classification model ...) may take very long time.
Instead of that, we can select some attributes(dimensions) which is called the most discriminative attributes. These attributes can almost describe the data set with lower number of attributes and this will speed up any process done on the data.
Attribute selection tab contains many different methods for selecting these attributes. One of them is CFS Feature Set Evaluation This filter gives you the attributes that have higher correlation with the class label which makes them discriminative attributes.

Related

Gremlin load data format

I am having difficulty understanding the Gremlin data load format (for use with Amazon Neptune).
Say I have a CSV with the following columns:
date_order_created
customer_no
order_no
zip_code
item_id
item_short_description
The requirements for the Gremlin load format are that the data is in an edge file and a vertex file.
The edge file must have the following columns: id, label, from and to.
The vertex file must have: id and label columns.
I have been referring to this page for guidance: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-gremlin.html
It states that in the edge file, the from column must equate to "the vertex ID of the from vertex."
And that (in the edge file) the to column must equate to "the vertex ID of the to vertex."
My questions:
Which columns need to be renamed to id, label, from and to? Or, should I add new columns?
Do I only need one vertex file or multiple?
You can have one or more of each CSV file (nodes, edges) but it is recommended to use fewer large files rather than many smaller ones. This allows the bulk loader to split the file up and load it in a parallel fashion.
As to the column headers, let's say you had a node (vertex) file of the form:
~id,~label,name,breed,age:Int
dog-1,Dog,Toby,Retriever,11
dog-2,Dog,Scamp,Spaniel,12
The edge file (for dogs that are friends), might look like this
~id,~label,~from,~to
e-1,FRIENDS_WITH,dog-1,dog-2
In Amazon Neptune, so long as they are unique, any user provided string can be used as a node or edge ID. So in your example, if customer_no is guaranteed to be unique, rather than store it as a property called customer_no you could instead make it the ~id. This can help later with efficient lookups. You can think of the ID as being a bit like a Primary Key in a relational database.
So in summary, you need to always provide the required fields like ~id and ~label. They are accessed differently using Gremlin steps such as hasLabel and hasId once the data is loaded. Columns with names from your domain like order_no will become properties on the node or edge they are defined with, and will be accessed using Gremlin steps such as has('order_no', 'ABC-123')
To follow on from Kelvin's response and provide some further detail around data modeling...
Before getting to the point of loading the data into a graph database, you need to determine what the graph data model will look like. This is done by first deriving a "naive" approach of how you think the entities in the data are connected and then validating this approach by asking the relevant questions (which will turn into queries) that you want to ask of the data.
By way of example, I notice that your dataset has information related to customers, orders, and items. It also has some relevant attributes related to each. Knowing nothing about your use case, I may derive a "naive" model that looks like:
What you have with your original dataset appears similar to what you might see in a relational database as a Join Table. This is a table that contains multiple foreign keys (the ids/no's fields) and maybe some related properties for those relationships. In a graph, relationships are materialized through the use of edges. So in this case, you are expanding this join table into the original set of entities and the relationships between each.
To validate that we have the correct model, we then want to look at the model and see if we can answer relevant questions that we would want to ask of this data. By example, if we wanted to know all items purchased by a customer, we could trace our finger from a customer vertex to the item vertex. Being able to see how to get from point A to point B ensures that we will be able to easily write graph queries for these questions later on.
After you derive this model, you can then determine how best to transform the original source data into the CSV bulk load format. So in this case, you would take each row in your original dataset and convert that to:
For your vertices:
~id, ~label, zip_code, date_order_created, item_short_description
customer001, Customer, 90210, ,
order001, Order, , 2023-01-10,
item001, Item, , , "A small, non-descript black box"
Note that I'm reusing the no's/ids for the customer, item, and order as the ID for their related vertices. This is always good practice as you can then easily lookup a customer, order, or item by that ID. Also note that the CSV becomes a sparse 2-dimensional array of related entities and their properties. I'm only providing the properties related to each type of vertex. By leaving the others blank, they will not be created.
For your edges, you then need to materialize the relationships between each entity based on the fact that they are related by being in the same row of your source "join table". These relationships did not previously have a unique identifier, so we can create one (it can be arbitrary or based on other parts of the data; it just needs to be unique). I like using the vertex IDs of the two related vertices and the label of the relationship when possible. For the ~from and ~to fields, we are including the vertices from which the relationship is deriving and what it is applying to, respectively:
~id, ~label, ~from, ~to
customer001-has_ordered-order001, has_ordered, customer001, order001
order001-contains-item001, contains, order001, item001
I hope that adds some further color and reasoning around how to get from your source data and into the format that Kelvin shows above.

How can I change the order of the attributes in Weka?

I was doing a machine learning task in Weka and the dataset has 486 attributes. So, I wanted to do attribute selection using chi-square and it provides me ranked attributes like below:
Now, I also have a testing dataset and I have to make it compatible. But how can I reorder the test attributes in the same manner that can be compatible with the train set?
Changing the order of attributes (e.g., when using the Ranker in conjunction with an attribute evaluator) will probably not have much influence on the performance of your classifier model (since all the attributes will stay in the dataset). Removing attributes, on the other hand, will more likely have an impact (for that, use subset evaluators).
If you want the ordering to get applied to the test set as well, then simply define your attribute selection search and evaluation schemes in the AttributeSelectedClassifier meta-classifier, instead of using the Attribute selection panel (that panel is more for exploration).

how to classify using j48 weka with information gain and random attribute selection?

I know that j48 decision tree uses gain ratio to select attribute for making tree.
But i want to use information gain and random selection instead of gain ratio. In select attribute tab in Weka Explorer, I choose InfoGainAttributeEval and put start button. After that I see the sorted list of attribute with information gain method. But I don't know how to use this list to run j48 in Weka. Moreover I don't know how to select attribute randomly in j48.
Please help me if you can.
If you want to perform feature selection on the data before running the algorithm you have two options:
In the Classify tab use AttributeSelectedClassifier (under the meta folder). There you can configure the feature selection algorithm you want. (The default is J48 with CfsSubsetEval).
In the Preprocess tab find and apply AttributeSelect filter (located at supervised\attribute folder). The default here is also the CfsSubsetEval algorithm.
Notice that the first method will apply the algorithm only on the train set when you'll evaluate the algorithm, while the second method will use the entire dataset and will remove features that were not selected (you can use undo to bring them back).
Notice that the way J48 selects features during the training process will remain the same. To change it you need to implement your own algorithm or change the current implementation.

Infragistics UltraGrid - How to use displayed values in group by headers when using an IEditorDataFilter?

I have a situation where I'm using the IEditorDataFilter interface within a custom UltraGrid editor control to automatically map values from a bound data source when they're displayed in the grid cells. In this case it's converting guid-based key values into user-friendly values, and it works well by displaying what I need in the cell, but retaining the GUID values as the 'value' behind the scenes.
My issue is what happens when I enable the built-in group by functionality and the user groups by a column using my editor. In that case the group by headers default to using the cell's value, which is the guid in my case, so I end up with headers like this:
Column A: 7F720CE8-123A-4A5D-95A7-6DC6EFFE5009 (10 items)
What I really want is the cell's display value to be used instead so it's something like this:
Column A: Item 1 (10 items)
What I've tried so far
Infragistics provides a couple mechanisms for modifying what's shown in group by rows:
GroupByRowDescriptionMask property of the grid (http://bit.ly/1g72t1b)
Manually set the row description via the InitializeGroupByRow event (http://bit.ly/1ix1CbK)
Option 1 doesn't appear to give me what I need because the cell's display value is not exposed in the set of tokens they provide. Option 2 looks promising but it's not clear to me how to get at the cell's display value. The event argument only appears to contain the cell's backing value, which in my case is the GUID.
Is there a proper approach for using the group by functionality when you're also using an IEditorDataFilter implementation to convert values?
This may be frowned upon, but I asked my question on the Infragistic forums as well, and a complete answer is available there (along with an example solution demonstrating the problem):
http://www.infragistics.com/community/forums/p/88541/439210.aspx
In short, I was applying my custom editors at the cell level, which made them unavailable when the rows were grouped together. A better approach would be to apply the editor at the column level, which would make the editor available at the time of grouping, and would provide the expected behavior.

Remove Missing Values in Weka

I'm using a dataset in Weka for classfication that includes missing values. As far as I understood, Weka replaces them automatically with the Modes or Mean of the training data (using the filter unsupervised/attribute/ReplaceMissingValues) when using a classifier like NaiveBayes.
I would like to try removing them, to see how this effects the quality of the classifier. Is there a filter to do that?
See this answer below for a better, modern approach.
My approach is not the perfect one because IF you have more than 5 or 6 attributes then it becomes quite cumbersome to apply but I can suggest that MultiFilter should be used for this purpose if only a few attributes have missing values.
If you have missing values in 2 attributes then you'll use RemoveWithValues 2 times in a MultiFilter.
Load your data in Weka Explorer
Select MultiFilter from the Filter area
Click on MultiFilter and Add RemoveWithValues
Then configure each RemoveWithValues filter with the attribute index and select True in matchMissingValues
Save the filter settings and click Apply in Explorer.
Use the removeIf() method on weka.core.Instances using the method reference from weka.core.Instance for the hasMissingValue method, which returns a boolean if a given Instance has any missing values.
Instances dataset = source.getDataSet(); // for some source
dataset.removeIf(Instance::hasMissingValue);