How does weka handle overfitting in the information gain calculation? - weka

I was just messing around with the Information Gain Attribute Evaluator of WEKA when I stumbled over some inconsistency when handling overfitting attributes. In my relation I use 3 numeric attributes and a class attribute which divides between failing and correct data sets. This class attribute only depends on attr3 while the other two attributes just have random values. So this is the ARFF file:
#relation TEST
#attribute attr1 numeric
#attribute attr2 numeric
#attribute attr3 numeric
#attribute class {failing,correct}
#data
1, 12.714813, 5000, failing
26, -10000, 0, correct
10, 9.547521, 5000, failing
-10000, 3.699694, 0, correct
38, 6.1541, 5000, failing
After I executed the Information Gain Attribute Evaluator with the default settings, this is the output ranking:
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 4 class):
Information Gain Ranking Filter
Ranked attributes:
0.971 3 attr3
0.971 2 attr2
0 1 attr1
Selected attributes: 3,2,1 : 3
Now, attribute 3 rightfully has an information gain of 0.971 which equals the entropy of the relation. Evaluating the other two attributes with the information gain does not really make sense as those contain highly mutual information. But what I do not understand, is why one overfitting attribute has the maximum information gain while the other one has the minimum information gain. How does the Information Gain Attribute Evaluator handle those overfitting attributes?

Related

Filtering on a Window Annotation in Django

I realize this is a well known 'limitation', that you cannot straightforwardly filter on a window functions annotation, but I am looking for ways around that. I have seen an answer to this question that suggests using a subquery, but I cannot work out what the syntax should be without getting a *** django.db.utils.ProgrammingError: more than one row returned by a subquery used as an expression error.
The data represents a timeseries with discrete states. These states change over time. I need to get a list of timestamps (or indices) where the 'state' changes from one to the next. Using a Window Function, I can get a queryset with the state change, but cannot then filter on that to get the index/timestamp of when the change occurs.
Very simplified (non-working) code is:
*models.py
class MyModel(models.Model):
parent = models.ForeignKey('MyParent')
timestamp = models.DateTimeField()
state = models.IntegerField()
*view
deltas = MyModel.objects.filter(parent=parent).annotate(
previous_state=Window(
expression=Lag('state', offset=1, default=0),
order_by=F('timestamp').asc()
),
delta=F('state')-F('previous_state')
).values('delta')
This gives a queryset of records with the diff value indicating a state change (e.g. where state goes from 0->1, diff = 1, where state goes from 3 -> 2, diff = -1, etc.), however I cannot filter on it to find where those state changes are. Ideally I could do deltas.filter(delta__gte=1) to find any increases in state, but of course this is not allowed.
Is there any other way to achieve this using Subqueries or similar? (db is PSQL)

Default values for Binary fields in Odoo

In brief, what is the default value for a binary field in Odoo?
More specifically, I'm trying to construct a computed field based on whether or not certain documents have been included in a record (i.e. a sort of status bar on the number of completed fields in the record).
As a toy example if bin1 and bin2 are binary fields and bool is boolean, then my progress would be computed as
progress = 100.0 * (1*bool + 1*(bin1 is not None) + 1*(bin2 is not None)) / 3
Fortunately, this computation works fine after the record is saved. However, while in Edit mode the progress is shown as if it were 2/3.
This brings be to the question of default values for binary fields or any ideas on how to extract the information about whether of not a binary field is filled or not.
An empty binary field is False a valued one contains base64 encoded string.
So, before you do your computatation you must do something like:
if item.bin_field:
bin_val = item.bin_field.decode('base64')
Your check if failing because you are doing an "identity comparison" so you are basically saying "is my value identical to None?" instead of checking if is boolean-ly false.

Select object by several enum values in rails

I use enum to store statuses of my model:
enum status: [ :fresh, :approved, :rejected, :returned, :completed, :removed ]
Now I want to select object with several values of status, something like this:
Documents.find_by_status(:fresh, :returned)
How should I do it correctly in Rails 4?
Every enum attribute has integer representation of its value in table column.
By default your statuses will have values – fresh: 0, approved: 1, rejected: 2 and so on.
The simplest way to get instances with one or another value is to call something like that
Document.where(status: [0, 1])
To improve readability you can implement scope in you model.
class Document < ActiveRecord::Base
enum status: %i(fresh approved rejected returned completed removed)
scope :find_by_status, ->(*args) { where(status: self.statuses.values_at(*args)) }
end
And use it more humanized way
Document.find_by_status(:fresh, :returned)

Deduplicaton / matching in Couchdb?

I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
email or
personal_blog_url or
telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
a_email#gmail.com [123, 345]
b_email#gmail.com [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
Or whether it is a good practice to do such deduplication in couchdb?
Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to _changes and search for the fields you want to be unique for the real user in the views you suggested (by_*).
Merge the views into one (emit different fields in one map):
function (doc) {
if (!doc.email || !doc.personal_blog_url || !doc.telephone) return;
emit([1, doc.email], [doc._id]);
emit([2, doc.personal_blog_url], [doc._id]);
emit([3, doc.telephone], [doc._id]);
}
Merge the lists of id's in reduce
When new doc in changes feed arrives, you can query the view with keys=[[1, email], [2, personal_blog_url], ...] and merge the three lists. If its minimal id is smaller then the changed doc, update the field realId, otherwise update the documents in the list with the changed id.
I suggest using different document to store { userId, realId } relation.
You can't create new documents by just using a view. You'd need a task of some sort to do the actual merging.
Here's one idea.
Instead of creating 3 views, you could create one view (that indexes the data if it exists):
Key Values
--- ------
[userId, 'phone'] 777-555-1212
[userId, 'email'] username#example.com
[userId, 'url'] favorite.url.example.com
I wouldn't store anything else except the raw value, as you'd end up with lots of unnecessary duplication of data (if you stored the full object for example).
Then, to query, you could do something like:
...startkey=[userId]&endkey=[userId,{}]
That would give you all of the duplicate information as a series of docs for that user Id. You'd still need to parse it apart to see if there were duplicates. But, this way, the results would be nicely merged into a single CouchDB call.
Here's a nice example of using arrays as keys on StackOverflow.
You'd still probably load the original "user" document if it had other data that wasn't part of the de-duplication process.
Once discovered, you could consider cleaning up the data on the fly and prevent new duplicates from occurring as new data is entered into your application.

Django object filter - price behaving strangely, eg 170 treated as 17 etc

I have a simple object filter that uses price__lt and price__gt. This works on a property on my product model called price, which is a CharField [string] (decimal saw the same errors, and caused trouble with aggregation so reverted to string).
It seems that when passing in these values to the filter, they are treated in a strange way, eg 10 is treated as 100. for example:
/products/price/10-200/ returns products priced 100-200. the filters are being passed in as filterargs: FILTER ARGS: {'price__lt': '200', 'price__gt': '10'} . This also breaks in the sense that price/0-170 will NOT return products priced at 18.50; it is treating the 170 as 'less than 18' for some reason.
any idea what would cause this, and how to fix it? Thanks!
The problem, as Jeff suggests, is that price is a CharField and thus is being compared using character-by-character string comparison logic, i.e. any string of any length starting with 1 will be less than any string of any length starting with 2, etc.
I'm curious what problems you were having with having price be an IntegerField, as that would seem to be the straightforward solution, but if you need to keep price as a CharField, here's a (hacky) way to make the query work:
lt = 200
gt = 10
qs = Product.objects.extra(select={'int_price': 'cast(price as int)'},
where=['int_price < %s', 'int_price > %s'],
params=[lt, gt])
qs.all() # the result
This uses the extra method of Django's QuerySet class, which you can read about in the docs here. In a nutshell, it computes an integer version of the string price using SQL's cast expression and then filters with integers based on that.