'DataFrame' object has no attribute 'value_counts' in pandas 0.25 - django

I am using
pandas==0.25.0
django-pandas==0.6.1
And I am using value_counts() to group for unique valor in two columns:
charges_mean_provinces = whatever.objects.filter(whatever = whatever).values('origin_province','destination_province')
df_charges_mean = pd.DataFrame(charges_mean_provinces)
df_charges_mean = df_charges_mean.value_counts().to_frame('cantidad').reset_index()
In local (development) it work correctly. But in production (I use Heroku), it return this error.
'DataFrame' object has no attribute 'value_counts'
Is there other way to group unique valor from two columns without use value_counts? Considering that I can not change my Pandas version in Heroku.
Anyway, value_counts is in pandas 0.25 documentation, so, I do not understand the error.

What version of Pandas are you using? Initially, value_counts is a method for series, rather than dataframes. You can call value_counts on a specific column, but not the frame itself.
After 1.10, that was updated and now value_counts is also a dataframe method. I recall seeing posts previously on here regarding this error before pandas was updated.

Related

Django manager with datetime.timedelta object inside F query combined with annotate and filter

I am trying to create manager method inside my app, to filter emails object, that have been created 5/10/15 minutes or what so ever, counting exactly from now.
I though I'am gonna use annotate to create new parameter, which will be bool and his state depends on simple subtraction with division and checking if the result is bigger than 0.
from django.db.models import F
from django.utils import timezone
delta = 60 * 1 * 5
current_date = timezone.now()
qs = self.annotate(passed=((current_date - F('created_at')).seconds // delta > 0)).filter(passed=True)
Atm my error says:
AttributeError: 'CombinedExpression' object has no attribute 'seconds'
It is clearly happening duo the fact, that ((current_date - F('created_at')) does not evaluate to datetime.timedelta object but to the CombinedExpression object.
I see more problems out there, i.e. how to compare the expression to 0?
Anyway, would appreciate any tips if I am somewhere close to achieve my goal or is my entire logic behind this query incorrect
Well, I managed to find the solution, even though it might not be the elegant one, it works
qs = self.annotate(foo=Sum(current_date - F('created_at'))).filter(foo__gt=Sum(timezone.timedelta(seconds=delta)))
Why not something like this:
time_cut_off = timezone.now() - timezone.timedelta(minutes=delta)
qs = self.filter(created_at__gte=time_cut_off)
This will get you the messages created in the last delta minutes. Or where you looking for messages created exactly 5 minutes ago (how do you define that if that is the question).
The documentation provides a simple and elegant solution if your timedelta is a constant :
For date and date/time fields, you can add or subtract a timedelta object. The following would return all entries that were modified more than 3 days after they were published:
>>> from datetime import timedelta
>>> Entry.objects.filter(mod_date__gt=F('pub_date') + timedelta(days=3))
In your case I don't think you even need the F() objects.

Pandas dataframe merge issue

I am learning python and pandas via Wes McKinney's Python for Data Analysis. One of the examples in Chapter 2 is a merge of MovieLens data on movie_id that is not working. I think the issue is that in ratings the movie_id is an int64 and in movies it is an object. The merge returns an empty data frame.
I have read some of the previous posts on pandas and automatic data type assignment and found the dtype in pandas.io.parsers.read_table documentation but cant get the type to change.
The original code:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames)
And what my research indicated what should work:
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames, dtype={'movie_id':np.int64})
Unfortunately, the type isn't changed and the merge still returns an empty set. I am running pandas 0.10.1
(Note I haven't looked up the book code, just your post)
First confirm the dtypes:
print ratings_df.dtypes
print movies_df.dtypes
If you find they're different types you could try (let's assume ratings_df.movie_id is object instead of int):
ratings_df.movie_id = ratings_df.movie_id.astype(int)
See if your merge now works.

Aggregate difference between DateTime fields in Django

I have a table containing a series of entries which relate to time periods (specifically, time worked for a client):
task_time:
id | start_time | end_time | client (fk)
1 08/12/2011 14:48 08/12/2011 14:50 2
I am trying to aggregate all the time worked for a given client, from my Django app:
time_worked_aggregate = models.TaskTime.objects.\
filter(client = some_client_id).\
extra(select = {'elapsed': 'SUM(task_time.end_time - task_time.start_time)'}).\
values('elapsed')
if len(time_worked_aggregate) > 0:
time_worked = time_worked_aggregate[0]['elapsed'].total_seconds()
else:
time_worked = 0
This seems inelegant, but it does work. Or at least so I thought: it turns out that it works fine on a PostgreSQL database, but when I move over to SQLite, everything dies.
A bit of digging suggests that the reason for this is that DateTimes aren't first-class data in SQLite. The following raw SQLite query will do my job:
SELECT SUM(strftime('%s', end_time) - strftime('%s', start_time)) FROM task_time WHERE ...;
My question is as follows:
The Python sample above seems roundabout. Can we do this more elegantly?
More importantly at this stage, can we do it in a way that will work on both Postgres and SQLite? Ideally, I'd like not to be writing raw SQL queries and switching on the database backend that happens to be in place; in general, Django is extremely good at protecting us from this. Does Django have a reasonable abstraction for this operation? If not, what's a sensible way for me to do a conditional switch on the backend?
I should mention for context that the dataset is many thousands of entries; the following is not really practical:
sum([task_time.end_date - task_time.start_date for task_time in models.TaskTime.objects.filter(...)])
Almost the same solution as #andri proposed. In the final result you will get the same data.
ExpressionWrapper - New in Django 1.8.
from datetime import timedelta
from django.db.models import ExpressionWrapper, F, fields
from app.models import MyModel
duration = ExpressionWrapper(F('closed_at') - F('opened_at'), output_field=fields.DurationField())
objects = MyModel.objects.closed().annotate(duration=duration).filter(duration__gt=timedelta(seconds=2))
for obj in objects:
print obj.id, obj.duration, obj.duration.seconds
# sample output
# 807 0:00:57.114017 57
# 800 0:01:23.879478 83
# 804 3:40:06.797188 13206
# 801 0:02:06.786300 126
I think since Django 1.8 we can do better:
I would like just to draw the part with annotation, the further part with aggregation should be straightforward:
from django.db.models import F, Func
SomeModel.objects.annotate(
duration = Func(F('end_date'), F('start_date'), function='age')
)
[more about postgres age function here: http://www.postgresql.org/docs/8.4/static/functions-datetime.html ]
each instance of SomeModel will be anotated with duration field containg time difference, which in python will be a datetime.timedelta() object [more about datetime timedelta here: https://docs.python.org/2/library/datetime.html#timedelta-objects ]
I will do it step by step:
first step:annotate the timedelta
group by and sum timedelta
the code like this:
from django.db.models import Count, Sum, F
times_obj_list = models.TaskTime.objects.annotate(times=F("end_time")-F("start_time"))
groupby_obj_list = times_obj_list.values("client").annotate(cnt=Count("id"),seconds=Sum(times)).order_by()
Django currently only supports aggregates for Min, Max, Avg and Count, so using raw SQL is the only way to achieve what you want. When you use raw SQL, database-independence is out the window, so unfortunately, you're out of luck. You'll have to just detect the database and alter the SQL appropriately.

How to aggregate computed field with django ORM? (without raw SQL)

I'm trying to find the cumulated duration of some events, 'start' and 'end' field are both django.db.models.DateTimeField fields.
What I would like to do should have been written like this:
from django.db.models import F, Sum
from my.models import Event
Event.objects.aggregate(anything=Sum(F('start') - F('end')))
# this first example return:
# AttributeError: 'ExpressionNode' object has no attribute 'split'
# Ok I'll try more SQLish:
Event.objects.extra(select={
'extra_field': 'start - end'
}).aggregate(Sum('extra_field'))
# this time:
# FieldError: Cannot resolve keyword 'extra_field' into field.
I can't agreggate (Sum) start and end separately then substract in python because DB can't Sum DateTime objects.
A good way to do without raw sql?
Can't help Christophe without a Delorean, but I was hitting this error and was able to solve it in Django 1.8 like:
total_sum = Event.objects\
.annotate(anything=Sum(F('start') - F('end')))\
.aggregate(total_sum=Sum('anything'))['total_sum']
When I couldn't upgrade all my dependencies to 1.8, I found this to work with Django 1.7.9 on top of MySQL:
totals = self.object_list.extra(Event.objects.extra(select={
'extra_field': 'sum(start - end)'
})[0]
If you are on Postgres, then you can use the django-pg-utils package and compute in the database. Cast the duration field into seconds and then take the sum
from pg_utils import Seconds
from django.db.models import Sum
Event.objects.aggregate(anything=Sum(Seconds(F('start') - F('end'))))
This answer don't realy satisfy me yet, my current work around works but it's not DB computed...
reduce(lambda h, e: h + (e.end - e.start).total_seconds(), events, 0)
It returns the duration of all events in the queryset in seconds
Better SQL less solutions?

Django: order by position ignoring NULL

I have a problem with Django queryset ordering.
My model contains a field named position, a PositiveSmallIntegerField which I'd like to used to order query results.
I use order_by('position'), which works great.
Problem : my position field is nullable (null=True, blank=True), because I don't wan't to specify a position for every 50000 instances of my model. When some instances have a NULL position, order_by returns them in the top of the list: I'd like them to be at the end.
In raw SQL, I used to write things like:
IF(position IS NULL or position='', 1, 0)
(see http://www.shawnolson.net/a/730/mysql-sort-order-with-null.html). Is it possible to get the same result using Django, without writing raw SQL?
You can use the annotate() from django agrregation to do the trick:
items = Item.objects.all().annotate(null_position=Count('position')).order_by('-null_position', 'position')
As of Django 1.8 you can use Coalesce() to convert NULL to 0.
Sample:
import datetime
from django.db.models.functions import Coalesce, Value
from app import models
# Coalesce works by taking the first non-null value. So we give it
# a date far before any non-null values of last_active. Then it will
# naturally sort behind instances of Box with a non-null last_active value.
the_past = datetime.datetime.now() - datetime.timedelta(days=10*365)
boxes = models.Box.objects.all().annotate(
new_last_active=Coalesce(
'last_active', Value(the_past)
)
).order_by('-new_last_active')
It's a shame there are a lot of questions like this on SO that are not marked as duplicate. See (for example) this answer for the native solution for Django 1.11 and newer. Here is a short excerpt:
Added the nulls_first and nulls_last parameters to Expression.asc() and desc() to control the ordering of null values.
Example usage (from comment to that answer):
from django.db.models import F
MyModel.objects.all().order_by(F('price').desc(nulls_last=True))
Credit goes to the original answer author and commenter.
Using extra() as Ignacio said optimizes a lot the end query. In my aplication I've saved more than 500ms (that's a lot for a query) in database processing using extra() instead of annotate()
Here is how it would look like in your case:
items = Item.objects.all().extra(
'select': {
'null_position': 'CASE WHEN {tablename}.position IS NULL THEN 0 ELSE 1 END'
}
).order_by('-null_position', 'position')
{tablename} should be something like {Item's app}_item following django's default tables name.
I found that the syntax in Pablo's answer needed to be updated to the following on my 1.7.1 install:
items = Item.objects.all().extra(select={'null_position': 'CASE WHEN {name of Item's table}.position IS NULL THEN 0 ELSE 1 END'}).order_by('-null_position', 'position')
QuerySet.extra() can be used to inject expressions into the query and order by them.