How to remove 4000 instance in WEKA evenly spread across class attribute? - weka

So I'm trying to move 4000 instances from my data set (there is 12660 instances with 10 values of class attribute)
When I use the Remove Percentage filter it take the first percentage of instances. I'd like to remove instances equally across the class attributes.
How can I do this?

Use Filter Supervised Instance Resample.

Related

Stata - repeated time values within panel

I have a dataset that has the following format:
Company|Dependent var|Independent vars|Company ID|Date|dummy1|dummy2
A|Values|Values|1|01/01/2015|0|1
A|Values|Values|1|01/01/2015|1|0
A|Values|Values|1|01/01/2014|1|0
B|Values|Values|2|01/01/2015|0|1
B|Values|Values|2|01/01/2014|0|1
As you can see, companies can have multiple values at the same period (as they are rated by 2 different agencies). The problem then arises when I use xtset to define my panel data it throws the "repeated time values within panel". I wish to cluster errors by company and so I define the panel data set using "xtset CompanyID Date". Is there a way I can get round the error?
I wish to distinguish between the two entries that stata perceives as the same (i.e. but isn't as the dummy variables differentiate between them) but still cluster errors bases on company (using company id). Do I need to create a new id? Will this lose clustering by company?
Any help would be appreciated.
Laurence
Follow up: Basically I found that I am dealing with what is known as a multidimensional panel (e.g. y_i_j_k) not a 2 dimensional panel (y_i_j) and as such you cant do two dimensional commands on a >2 dimensional panel. As such I needed to reframe the panel 2 two dimensions by creating a new ID (egen newID = group(companyID Dummy1 Dummy2) This then allows you to use two dimensional commands. I think you can then group the data later using cluster (vce(cluster clustervar)). Thanks

Removing instances in weka to normalize result classes

Trying construct a classifier, however one class is highly overrepresented.
I have tried weighting, but the data set is large and I need to sub sample it.
So I thought I might be able to remove some instances from the overrepresented class instead.
Is there any filters I could use to remove a subset of only one of the classes?
Take a look at SpreadSubsample filter in supervised instance filters. It should do the job. Good luck.

Add new attribute calculated based on other attributes

I'm starting with WEKA and want to achieve the following.
I have file with 2 attributes: user_id, user_age.
I can successfully load data using WEKA API and get Instances object.
Now I want to calculate new attribute user_age_range - like (0-18) - 0, (19-25) - 1, etc.
Is there a way to calculate this attribute using WEKA Filters?
Also I would like not to iterate manually through all instances, but to define method that operates on single Instance and use some filter (or other abstraction) that'll apply corresponding "transformation" to all instances.
Please advice - how I could achieve this.
Thanks in advance.
After looking through the docs I found one or two filters that you could use in conjunction to achieve what you want.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/Copy.html
Use copy to create a copy that you will transform.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/NumericTransform.html
The numeric transform takes a class and a method option, you could write your own class that boxes the ages into the ranges you want and supply this class and method as your options.
Hope this helps
Using a csv file you can do that on Excel.
If you are using arff files, convert it to csv and then you can add the columns that you want depending on the number of new attributes and then just do whatever you want to do with one or more atributes on the first row. Extend that to all rows and it's done.

Assigning unique IDs to String using MapReduce

I want to run a MapReduce Job where I want to scan multiple columns from a given file and assign a unique ID(Index No.) to each distinct value for each column. The main challenge is to share the same ID for same value that is encountered on different node or different instances of Reducer.
Currently, I am using zookeeper for sharing the Unique IDs, but that is having its performance impact. I have even kept the information in local cache's at reducer level to avoid multiple trips to zookeeper for same value. I wanted to explore if there is any other better mechanism to do the same.
I can suggest two possible solutions for your problem
Create unique ID based on your value. This might be a hash function with low collision rate.
Use faster storage than ZooKeeper. You can try simple key value storage like Redis to store value to id mapping.

Changing a nominal variable to remove one particular label with zero instances in weka

I have a very simple questions, but I am all confused with the user interface, and I could not find it in the documentation.
I have a feature in my dataset that is nominal. It used to have 4 classes but I deleted the instances of one class. Now I want to classify based on this feature.
BUT, in the preprocess window, the attribute is still listed as having 4 classes, of which one has 0 instances. It performs the classification as it should, but in the result, there is a column/row in the confusion matrix and accuracy table for the zero class.
Is there a way to remove the label with zero instances, zo weka thinks that the feature only has three values?
Thanks!
Ok got it. I opened the weka file in an editor. There I could change the feature definition.