Django queryset aggregate by time interval - django

Hi I am writing a Django view which ouputs data for graphing on the client side (High Charts). The data is climate data with a given parameter recorded once per day.
My query is this:
format = '%Y-%m-%d'
sd = datetime.datetime.strptime(startdate, format)
ed = datetime.datetime.strptime(enddate, format)
data = Climate.objects.filter(recorded_on__range = (sd, ed)).order_by('recorded_on')
Now, as the range is increased the dataset obviously gets larger and this does not present well on the graph (aside from slowing things down considerably).
Is there an way to group my data as averages in time periods - specifically average for each month or average for each year?
I realize this could be done in SQL as mentioned here: django aggregation to lower resolution using grouping by a date range
But I would like to know if there is a handy way in Django itself.
Or is it perhaps better to modify the db directly and use a script to populate month and year fields from the timestamp?
Any help much appreciated.

Have you tried using django-qsstats-magic (https://github.com/kmike/django-qsstats-magic)?
It makes things very easy for charting, here is a timeseries example from their docs:
from django.contrib.auth.models import User
import datetime, qsstats
qs = User.objects.all()
qss = qsstats.QuerySetStats(qs, 'date_joined')
today = datetime.date.today()
seven_days_ago = today - datetime.timedelta(days=7)
time_series = qss.time_series(seven_days_ago, today)
print 'New users in the last 7 days: %s' % [t[1] for t in time_series]

Related

How to add weeks to a datetime column, depending on a django model/dictionary?

Context
There is a dataframe of customer invoices and their due dates.(Identified by customer code)
Week(s) need to be added depending on customer code
Model is created to persist the list of customers and week(s) to be added
What is done so far:
Models.py
class BpShift(models.Model):
bp_name = models.CharField(max_length=50, default='')
bp_code = models.CharField(max_length=15, primary_key=True, default='')
weeks = models.IntegerField(default=0)
helper.py
from .models import BpShift
# used in views later
def week_shift(self, df):
df['DueDateRange'] = df['DueDate'] + datetime.timedelta(
weeks=BpShift.objects.get(pk=df['BpCode']).weeks)
I realised my understanding of Dataframes is seriously flawed.
df['A'] and df['B'] would return Series. Of course, timedelta wouldn't work like this(weeks=BpShift.objects.get(pk=df['BpCode']).weeks).
Dataframe
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d)
Customer List csv
BP Name,BP Code,Week(s)
Customer1,CA0023MY,1
Customer2,CA0064SG,1
Error
BpShift matching query does not exist.
Commentary
I used these methods in hope that I would be able to change the dataframe at once, instead of
using df.iterrows(). I have recently been avoiding for loops like a plague and wondering if this
is the "correct" mentality. Is there any recommended way of doing this? Thanks in advance for any guidance!
This question Python & Pandas: series to timedelta will help to take you from Series to timedelta. And although
pandas.Series(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('weeks', flat=True)
)
will give you a Series of integers, I doubt the order is the same as in df['BpCode']. Because it depends on the django Model and database backend.
So you might be better off to explicitly create not a Series, but a DataFrame with pk and weeks columns so you can use df.join. Something like this
pandas.DataFrame(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('pk', 'weeks'),
columns=['BpCode', 'weeks'],
)
should give you a DataFrame that you can join with.
So combined this should be the gist of your code:
django_response = [('customer1', 1), ('customer2', '2')]
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d).set_index('BpCode').join(
pd.DataFrame(django_response, columns=['BpCode', 'weeks']).set_index('BpCode')
)
df['DueDate'] = pd.to_datetime(df['DueDate'])
df['weeks'] = pd.to_numeric(df['weeks'])
df['new_duedate'] = df['DueDate'] + df['weeks'] * pd.Timedelta('1W')
print(df)
DueDate weeks new_duedate
BpCode
customer1 2020-05-30 1 2020-06-06
customer2 2020-04-30 2 2020-05-14
You were right to want to avoid looping. This approach gets all the data in one SQL query from your Django model, by using filter. Then does a left join with the DataFrame you already have. Casts the dates and weeks to the right types and then computes a new due date using the whole columns instead of loops over them.
NB the left join will give NaN and NaT for customers that don't exist in your Django database. You can either avoid those rows by passing how='inner' to df.join or handle them whatever way you like.

Sum greatest values of each day from period with Django Query

My proj has a model that goes like:
class Data(Model):
data = FloatField(verbose_name='Data', null=True, blank=True)
created_at = DateTimeField(verbose_name='Created at')
And my app creates a few hundred logs of this model per day.
I'm trying to sum only the greatest values of each day, without having to iterate over them (using only Django queries).
Is it possible without writing SQL queries?
PS: I'm able to get the greatest 'data' of each day, so the current logic iterates over days and sums the greatest values of each day. But that solution is becoming too slow and I'd like to solve it directly into db level.
Annotations and aggregates to the rescue:
from django.db.models import Sum, Max
from django.db.models.functions import Trunc
report = (Data.objects
.annotate(day=Trunc('created_at', 'day'))
.values('day')
.annotate(greatest=Max('data'))
.values('greatest')
.aggregate(total=Sum('greatest'))
)
print(report['total'])
The resulting SQL is almost simpler than the code:
SELECT SUM("greatest")
FROM
(SELECT MAX("app_data"."data_id") AS "greatest"
FROM "app_data"
GROUP BY DATE_TRUNC('day', "app_data"."created_at")) subquery
If you are using a database backed that supports distinct on fields (like postgres does) you can do.
Data.objects.order_by('created_at__date', '-data').distinct('created_at__date')

Using Django ORM to retrieve recent rows

In SQL, if I wanted to query a table for data from the most recent 10 minutes (regardless of timezones and such), I'd simply do (using postgresql parlance):
select * from table where creation_time > now() - interval'10 mins';
Is there an equivalent way to do something like this using the Django ORM, disregarding what timezone settings one has set for the app? Would be great to get an illustrative example here.
Try this:-
Data within 10 minutes :-
from datetime import datetime, timedelta
time_threshold = datetime.now() - timedelta(minutes=10)
results = Table.objects.filter(createdOn__lte=time_threshold)
Last 10 rows based on createdOn value:-
recentData = Table.objects.all().order_by('-createdOn')[:10]
Last 10 rows if you don't have createdOn column to filter:-
recentData = Table.objects.all().order_by('-id')[:10]

Get the total values in each month using Django QuerySet

I have this model for employee overtime hours
class Overtime(IncomeBase):
day = models.DateField(verbose_name="Date")
value = models.FloatField(default=1)
I need to extract the total value for each month. Now I am using a daily QuerySet in the manager.
class OvertimeManager(models.Manager):
def daily_report(self):
return self.values('day').annotate(hours=models.Sum('value')).order_by('-day')
However now I need a monthly report that will get the Sum of value for each month.
I tried to extract the month first but then I lose the values.
Note: the month should not have the total sum for all years, so specifically I need to group by month,year
If you are using Postgresql you can do this. Ofc there is similar fuctions.
Overtime.objects.extra({'month': "to_char(day, 'Mon')", "year": "extract(year from day)"}).values('month', 'year').annotate(Sum('value'))
More info:
http://www.postgresql.org/docs/7.4/static/functions-formatting.html
http://www.postgresql.org/docs/9.1/static/functions-datetime.html
Or django way:
from django.db import connection
truncate_month = connection.ops.date_trunc_sql('month','day')
Overtime.objects.extra({'month': truncate_month}).values('month').annotate(Sum('value'))
I think this will help you.

App Engine GQL: querying a date range

What would be the App Engine equivalent of this Django statement?
return Post.objects.get(created_at__year=bits[0],
created_at__month=bits[1],
created_at__day=bits[2],
slug__iexact=bits[3])
I've ended up writing this:
Post.gql('WHERE created_at > DATE(:1, :2, :3) AND created_at < DATE(:1, :2, :4) and slug = :5',
int(bit[0]), int(bit[1]), int(bit[2]), int(bit[2]) + 1, bit[3])
But it's pretty horrific compared to Django. Any other more Pythonic/Django-magic way, e.g. with Post.filter() or created_at.day/month/year attributes?
How about
from datetime import datetime, timedelta
created_start = datetime(year, month, day)
created_end = created_start + timedelta(days=1)
slug_value = 'my-slug-value'
posts = Post.all()
posts.filter('created_at >=', created_start)
posts.filter('created_at <', created_end)
posts.filter('slug =', slug_value)
# You can iterate over this query set just like a list
for post in posts:
print post.key()
You don't need 'relativedelta' - what you describe is a datetime.timedelta. Otherwise, your answer looks good.
As far as processing time goes, the nice thing about App Engine is that nearly all queries have the same cost-per-result - and all of them scale proportionally to the records returned, not the total datastore size. As such, your solution works fine.
Alternately, if you need your one inequality filter for something else, you could add a 'created_day' DateProperty, and do a simple equality check on that.
Ended up using the relativedelta library + chaining the filters in jQuery style, which although not too Pythonic yet, is a tad more comfortable to write and much DRYer. :) Still not sure if it's the best way to do it, as it'll probably require more database processing time?
date = datetime(int(year), int(month), int(day))
... # then
queryset = Post.objects_published()
.filter('created_at >=', date)
.filter('created_at <', date + relativedelta(days=+1))
...
and passing slug to the object_detail view or yet another filter.
By the way you could use the datetime.timedelta. That lets you find date ranges or date deltas.