Add new attribute calculated based on other attributes - weka

I'm starting with WEKA and want to achieve the following.
I have file with 2 attributes: user_id, user_age.
I can successfully load data using WEKA API and get Instances object.
Now I want to calculate new attribute user_age_range - like (0-18) - 0, (19-25) - 1, etc.
Is there a way to calculate this attribute using WEKA Filters?
Also I would like not to iterate manually through all instances, but to define method that operates on single Instance and use some filter (or other abstraction) that'll apply corresponding "transformation" to all instances.
Please advice - how I could achieve this.
Thanks in advance.

After looking through the docs I found one or two filters that you could use in conjunction to achieve what you want.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/Copy.html
Use copy to create a copy that you will transform.
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/NumericTransform.html
The numeric transform takes a class and a method option, you could write your own class that boxes the ages into the ranges you want and supply this class and method as your options.
Hope this helps

Using a csv file you can do that on Excel.
If you are using arff files, convert it to csv and then you can add the columns that you want depending on the number of new attributes and then just do whatever you want to do with one or more atributes on the first row. Extend that to all rows and it's done.

Related

Update multiple field values matching a condition in InfluxDB

In an InfluxDB measurement, how can the field values of points matching a query be updated? Is this still not easily doable as of v1.6?
As the example in that GitHub ticket suggested, what's the cleanest way of achieving something like this?
UPDATE access_log SET username='something' WHERE mac='xxx'
Anything better than driving it all from the client by updating individual points?
Q: How can the field values of points matching a query be updated? Is this still not easily doable as of v1.4?
A: From the best of my knowledge, there isn't an easy way to accomplish update in version 1.4 yet.
Field value of a point can only be updated by overriding. That is, to overwrite its value you'll need to know the details of your points. These details include its timestamp and series information, which is the measurement it reside and its corresponding tags.
Note: This "update" strategy can only be used for changing the field value but not tag value. To update a tag value you'll need to first DELETE the point data first and rewrite the entire point data with the updated tag and value.
Q: Anything better than driving it all from the client by updating individual points?
A: Influxdb supports multi-point write. So if you can build a filter to pre-select a small dataset of points, modify their field values and then override them in bulk.
Update is possible and would take the format:
INSERT measurement,tag_name=tag_value_no_quotes value_key_1=value_value_1,value_key_2=value_value_2 time
for example where I want to update the line with tag my_box at time 1526988768877018669 on the box measurement:
INSERT box,box_name=my_box item_1='apple',item_2='melon' 1526988768877018669

Removing instances in weka to normalize result classes

Trying construct a classifier, however one class is highly overrepresented.
I have tried weighting, but the data set is large and I need to sub sample it.
So I thought I might be able to remove some instances from the overrepresented class instead.
Is there any filters I could use to remove a subset of only one of the classes?
Take a look at SpreadSubsample filter in supervised instance filters. It should do the job. Good luck.

Remove Missing Values in Weka

I'm using a dataset in Weka for classfication that includes missing values. As far as I understood, Weka replaces them automatically with the Modes or Mean of the training data (using the filter unsupervised/attribute/ReplaceMissingValues) when using a classifier like NaiveBayes.
I would like to try removing them, to see how this effects the quality of the classifier. Is there a filter to do that?
See this answer below for a better, modern approach.
My approach is not the perfect one because IF you have more than 5 or 6 attributes then it becomes quite cumbersome to apply but I can suggest that MultiFilter should be used for this purpose if only a few attributes have missing values.
If you have missing values in 2 attributes then you'll use RemoveWithValues 2 times in a MultiFilter.
Load your data in Weka Explorer
Select MultiFilter from the Filter area
Click on MultiFilter and Add RemoveWithValues
Then configure each RemoveWithValues filter with the attribute index and select True in matchMissingValues
Save the filter settings and click Apply in Explorer.
Use the removeIf() method on weka.core.Instances using the method reference from weka.core.Instance for the hasMissingValue method, which returns a boolean if a given Instance has any missing values.
Instances dataset = source.getDataSet(); // for some source
dataset.removeIf(Instance::hasMissingValue);

Custom Date Aggregate Function

I want to sort my Store models by their opening times. Store models contains is_open function which controls Store's opening time ranges and produces a boolean if it's open or not. The problem is I don't want to sort my queryset manually because of efficiency problem. I thought if I write a custom annotate function then I can filter the query more efficiently.
So I googled and found that I can extend Django's aggregate class. From what I understood, I have to use pre-defined sql functions like MAX, AVG etc. The thing is I want to check that today's date is in a given list of time intervals. So anyone can help me that which sql name should I use ?
Edit
I'd like to put the code here but it's really a spaghetti one. One pages long code only generates time intervals and checks the suitable one.
I want to avoid :
alg= lambda r: (not (s.is_open() and s.reachable))
sorted(stores,key=alg)
and replace with :
Store.objects.annotate(is_open = CheckOpen(datetime.today())).order_by('is_open')
But I'm totally lost at how to write CheckOpen...
have a look at the docs for extra

How can i query to get the multiple values in SimpleDB (AWS)

jpg
In that Picture i have colored one part. i have attribute called "deviceModel". It contains more than one value.. i want to take using query from my domain which ItemName() contains deviceModel attribute values more than one value.
Thanks,
Senthil Raja
There is no direct approach to get what you are asking.. You need to manipulate by writing your own piece of code. By running SELECT query you will get the item Attribute-value pair. So here you need to traverse each each itemName() and count values of your desire attribute.
I think what you are refering to is called MultiValued Attributes. When you put a value in the attribute - if you don't replace the existing attribute value the values will multiply, giving you an array of items connected to the value of that attribute name.
How you create them will depend on the sdk/language you are using for your REST calls, however look for the Replace=true/false when you set the attribute's value.
Here is the documentation page on retrieving them: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/ (look under Using Amazon SimpleDB -> Using Select to Create Amazon SimpleDB Queries -> Queries on Attributes with Multiple Values)