django query aggregate function is slow? - django

I am working with Django to see how to handle large databases. I use a database with fields name, age, date of birth(dob) and height. The database has about 500000 entries. I have to find the average height of persons of (1) same age and (2) born in same year. The aggregate function in querying table takes about 10s. Is it usual or am I missing something?
For age:
age = [i[0] for i in Data.objects.values_list('age').distinct()]
ht = []
for each in age:
aggr = Data.objects.filter(age=each).aggregate(ag_ht=Avg('height')
ht.append(aggr)
From dob,
age = [i[0].year for i in Data.objects.values_list('dob').distinct()]
for each in age:
aggr = Data.objects.filter(dob__contains=each).aggregate(ag_ht=Avg(‌​'height')
ht.append(aggr)
The year has to be extracted from dob. It is SQLite and I cannot use __year (join).

For these queries to be efficient, you have to create indexes on the age and dob columns.
You will get a small additional speedup by using covering indexes, i.e., using two-column indexes that also include the height column.

full version with time compare loop and query set version
import time
from dd.models import Data
from django.db.models import Avg
from django.db.models.functions import ExtractYear
for age
start = time.time()
age = [i[0] for i in Data.objects.values_list('age').distinct()]
ht = []
for each in age:
aggr = Data.objects.filter(age=each).aggregate(ag_ht=Avg('height'))
ht.append(aggr)
end = time.time()
loop_time = end - start
start = time.time()
qs = Data.objects.values('age').annotate(ag_ht=Avg('height')).order_by('age')
ht_qs = qs.values_list('age', 'ag_ht')
end = time.time()
qs_time = end - start
print loop_time / qs_time
for dob year, with easy refactoring your version(add set in the years)
start = time.time()
years = set([i[0].year for i in Data.objects.values_list('dob').distinct()])
ht_year_loop = []
for each in years:
aggr = Data.objects.filter(dob__contains=each).aggregate(ag_ht=Avg('height'))
ht_year_loop.append((each, aggr.get('ag_ht')))
end = time.time()
loop_time = end - start
start = time.time()
qs = Data.objects.annotate(dob_year=ExtractYear('dob')).values('dob_year').annotate(ag_ht=Avg('height'))
ht_qs = qs.values_list('dob_year', 'ag_ht')
end = time.time()
qs_time = end - start
print loop_time / qs_time

Related

average spending per day - django model

I have a model that looks something like that:
class Payment(TimeStampModel):
timestamp = models.DateTimeField(auto_now_add=True)
amount = models.FloatField()
creator = models.ForeignKey(to='Payer')
What is the correct way to calculate average spending per day?
I can aggregate by day, but then the days when a payer does not spend anything won't count, which is not correct
UPDATE:
So, let's say I have only two records in my db, one from March 1, and one from January 1. The average spending per day should be something
(Sum of all spendings) / (March 1 - January 1)
that is divided by 60
however this of course give me just an average spending per item, and number of days will give me 2:
for p in Payment.objects.all():
print(p.timestamp, p.amount)
p = Payment.objects.all().dates('timestamp','day').aggregate(Sum('amount'), Avg('amount'))
print(p
Output:
2019-03-05 17:33:06.490560+00:00 456.0
2019-01-05 17:33:06.476395+00:00 123.0
{'amount__sum': 579.0, 'amount__avg': 289.5}
You can aggregate min and max timestamp and the sum of amount:
from django.db.models import Min, Max, Sum
def average_spending_per_day():
aggregate = Payment.objects.aggregate(Min('timestamp'), Max('timestamp'), Sum('amount'))
min_datetime = aggregate.get('timestamp__min')
if min_datetime is not None:
min_date = min_datetime.date()
max_date = aggregate.get('timestamp__max').date()
total_amount = aggregate.get('amount__sum')
days = (max_date - min_date).days + 1
return total_amount / days
return 0
If there is a min_datetime then there is some data in the db table, and there is also max date and total amount, otherwise we return 0 or whatever you want.
It depends on your backend, but you want to divide the sum of amount by the difference in days between your max and min timestamp. In Postgres, you can simply subtract two dates to get the number of days between them. With MySQL there is a function called DateDiff that takes two dates and returns the number of days between them.
class Date(Func):
function = 'DATE'
class MySQLDateDiff(Func):
function = 'DATEDIFF'
def __init__(self, *expressions, **extra):
expressions = [Date(exp) for exp in expressions]
extra['output_field'] = extra.get('output_field', IntegerField())
super().__init__(*expressions, **extra)
class PgDateDiff(Func):
template = "%(expressions)s"
arg_joiner = ' - '
def __init__(self, *expressions, **extra):
expressions = [Date(exp) for exp in expressions]
extra['output_field'] = extra.get('output_field', IntegerField())
super().__init__(*expressions, **extra)
agg = {
avg_spend: ExpressionWrapper(
Sum('amount') / (PgDateDiff(Max('timestamp'), Min('timestamp')) + Value(1)),
output_field=DecimalField())
}
avg_spend = Payment.objects.aggregate(**agg)
That looks roughly right to me, of course, I haven't tested it. Of course, use MySQLDateDiff if that's your backend.

Python script | long running | Need suggestions to optimize

I have written this script to generate a dataset which would contain 15 minute time intervals based on the inputs provided for operational hours for all days of a week for 365 days.
example: Let us say Store 1 opens at 9 AM and closes at 9 PM on all days. That is 12 hours everyday. 12*4 = 48(15 minute periods a day). 48 * 365 = 17520 (15 minute periods for a year).
The sample dataset only contains 5 sites but there are about 9000 sites that this script needs to generate data for.
The script obviously runs for a handful of sites(100) and couple of days(2) but needs to run for sites(9000) and 365 days.
Looking for suggestions to make this run faster. This will be running on a local machine.
input data: https://drive.google.com/open?id=1uLYRUsJ2vM-TIGPvt5RhHDhTq3vr4V2y
output data: https://drive.google.com/open?id=13MZCQXfVDLBLFbbmmVagIJtm6LFDOk_T
Please let me know if I can help with anything more to get this answered.
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
import pandas as pd
import numpy as np
import cProfile
from datetime import timedelta, date, datetime
#inputs
empty_data = pd.DataFrame(columns=['store','timestamp'])
start_dt = date(2019, 1, 1)
days = 365
data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day = curr_date.day
weekno = curr_date.weekday()
if weekno<5:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['m_f_open_hrs'].iloc[j],data['m_f_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['m_f_close_hrs'].iloc[j],data['m_f_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
elif weekno==5:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sat_open_hrs'].iloc[j],data['sat_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sat_close_hrs'].iloc[j],data['sat_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
else:
dts = [dt.strftime('%Y-%m-%d %H:%M') for dt in
datetime_range(datetime(curr_date_year,curr_date_month,curr_date_day,data['sun_open_hrs'].iloc[j],data['sun_open_min'].iloc[j]), datetime(curr_date_year,curr_date_month,curr_date_day, data['sun_close_hrs'].iloc[j],data['sun_close_min'].iloc[j]),
timedelta(minutes=15))]
vert = pd.DataFrame(dts,columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
final_data = empty_data
I think the most time consuming tasks in your script are the datetime calculations.
You should try to make all of those calculations using UNIX Time. It basically represents time as an integer that counts seconds... so you could take two UNIX dates and see the difference just by doing simple subtraction.
In my opinion you should perform all the operations like that... and when the process has finished you can make all the datetime conversions to a more readable date format.
Other thing that you should change in your script is all the code repetition that is almost identical. It won't improve the performance, but it improves readability, debugging and your skills as a programmer. As a simple example I have refactored some of the code (you probably can do better than what I did, but this is just an example).
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
from datetime import timedelta, date, datetime
import numpy as np
import cProfile
import pandas as pd
# inputs
empty_data = pd.DataFrame(columns=['store', 'timestamp'])
start_dt = date(2019, 1, 1)
days = 365
data = "input data | attached to the post"
for i in range(days):
for j in range(len(data.store)):
curr_date = start_dt + timedelta(days=i)
curr_date_year = curr_date.year
curr_date_month = curr_date.month
curr_date_day = curr_date.day
weekno = curr_date.weekday()
week_range = 'sun'
if weekno < 5:
week_range = 'm_f'
elif weekno == 5:
week_range = 'sat'
first_time = datetime(curr_date_year,curr_date_month,curr_date_day,data[week_range + '_open_hrs'].iloc[j],data[week_range + '_open_min'].iloc[j])
second_time = datetime(curr_date_year,curr_date_month,curr_date_day, data[week_range + '_close_hrs'].iloc[j],data[week_range + '_close_min'].iloc[j])
dts = [ dt.strftime('%Y-%m-%d %H:%M') for dt in datetime_range(first_time, second_time, timedelta(minutes=15)) ]
vert = pd.DataFrame(dts, columns = ['timestamp'])
vert['store']= data['store'].iloc[j]
empty_data = pd.concat([vert, empty_data])
final_data = empty_data
Good luck!

unsupported operand types in django

first date
d = datetime.datetime.utcnow().replace(tzinfo=utc)
second date
checkin = models.DateTimeField(default = timezone.now)
e = Checkin.objects.all().values()
t = last value in 'e'
co = d.time()
ci = t.time()
I want difference between 'co' and 'ci'
It looks like you probably need to make both of your datetime objects time zone aware.
I have a local Page model that has a DateTimeField called first_published_at. Here's how I handle:
target_tz = datetime.tzinfo('utc')
now_dt = datetime.datetime.utcnow().replace(tzinfo=target_tz)
inst = m.Page.objects.get(pk=18)
model_dt = inst.first_published_at.replace(tzinfo=target_tz)
print(now_dt - model_dt) # 734 days, 12:46:53.059321
Or you could make both of them timezone naive:
target_tz = None
now_dt = datetime.datetime.utcnow().replace(tzinfo=target_tz)
inst = m.Page.objects.get(pk=18)
model_dt = inst.first_published_at.replace(tzinfo=target_tz)
print(now_dt - model_dt) # Same as above

Find object in datetime range

Let's assume that I have modeL;
class MyModel(...):
start = models.DateTimeField()
stop = models.DateTimeField(null=True, blank=True)
And I have also two records:
start=2012-01-01 7:00:00 stop=2012-01-01 14:00:00
start=2012-01-01 7:00:03 stop=2012-01-01 23:59:59
Now I want to find the second query, so start datetime should be between start and stop, and stop should have hour 23:59:59. How to bould such query?
Some more info:
I think this requires F object. I want to find all records where start -> time is between another start -> time and stop -> time, and stop -> time is 23:59:59, and date is the same like in start
YOu can use range and extra:
from django.db.models import Q
q1 = Q( start__range=(start_date_1, end_date_1) )
q1 = Q( start__range=(start_date_2, end_date_2) )
query = (''' EXTRACT(hour from end_date) = %i
and EXTRACT(minute from end_date) = %i
and EXTRACT(second from end_date) = %i''' %
(23, 59,59)
)
MyModel.objects.filter( q1 | q2).extra(where=[query])
Notice: Posted before hard answer requirement changed 'time is 23:59:59, and date is the same like in start'
To perform the query: "start datetime should be between start and stop"
MyModel.objects.filter(start__gte=obj1.start, start__lte=obj1.stop)
I don't quite understand your second condition, though. Do you want it to match only objects with hour 23:59:59, but for any day?
dt = '2012-01-01 8:00:00'
stop_hour = '23'
stop_minute = '59'
stop_sec = '59'
where = 'HOUR(stop) = %(hour)s AND MINUTE(stop) = %(minute)s AND SECOND(stop) = %(second)s' \
% {'hour': stop_hour, 'minute': stop_minute, 'seconds': stop_ec}
objects = MyModel.objects.filter(start__gte=dt, stop__lte=dt) \
.extra(where=[where])

Django: Total birthdays each day for the next 30 days

I've got a model similar to this:
class Person(models.Model):
name = models.CharField(max_length=40)
birthday = DateTimeField() # their next birthday
I would like to get a list of the total birthdays for each day for the next 30 days. So for example, the list would look like this:
[[9, 0], [10, 3], [11, 1], [12, 1], [13, 5], ... #30 entries in list
Each list entry in the list is a date number followed by the number of birthdays on that day. So for example on the 9th of May there are 0 birthdays.
UPDATES
My db is sqlite3 - will be moving to postgres in the future.
from django.db.models import Count
import datetime
today = datetime.date.today()
thirty_days = today + datetime.timedelta(days=30)
birthdays = dict(Person.objects.filter(
birthday__range=[today, thirty_days]
).values_list('birthday').annotate(Count('birthday')))
for day in range(30):
date = today + datetime.timedelta(day)
print "[%s, %s]" % (date, birthdays.get(date, 0))
I would get the list of days and birthday count this way:
from datetime import date, timedelta
today = date.today()
thirty_days = today + timedelta(days=30)
# get everyone with a birthday
people = Person.objects.filter(birthday__range=[today, thirty_days])
birthday_counts = []
for date in [today + timedelta(x) for x in range(30)]:
# use filter to get only birthdays on given date's day, use len to get total
birthdays = [date.day, len(filter(lambda x: x.birthday.day == date.day, people))]
birthday_counts.append(birthdays)
Something like this --
from datetime import date, timedelta
class Person(models.Model):
name = models.CharField(max_length=40)
birthday = models.DateField()
#staticmethod
def upcoming_birthdays(days=30):
today = date.today()
where = 'DATE_ADD(birthday, INTERVAL (YEAR(NOW()) - YEAR(birthday)) YEAR) BETWEEN DATE(NOW()) AND DATE_ADD(NOW(), INTERVAL %S DAY)'
birthdays = Person.objects.extra(where=where, params=[days]).values_list('birthday', flat=True)
data = []
for offset in range(0, days):
i = 0
d = today + timedelta(days=offset)
for b in birthdays:
if b.day == d.day and b.month == d.month:
i += 1
data.append((d.day, i))
return data
print Person.upcoming_birthdays()
(Queryset of people with a birthday in the next X days)
Found cool solution for this!
For me it works!
from datetime import datetime, timedelta
import operator
from django.db.models import Q
def birthdays_within(days):
now = datetime.now()
then = now + timedelta(days)
# Build the list of month/day tuples.
monthdays = [(now.month, now.day)]
while now <= then:
monthdays.append((now.month, now.day))
now += timedelta(days=1)
# Tranform each into queryset keyword args.
monthdays = (dict(zip(("birthday__month", "birthday__day"), t))
for t in monthdays)
# Compose the djano.db.models.Q objects together for a single query.
query = reduce(operator.or_, (Q(**d) for d in monthdays))
# Run the query.
return Person.objects.filter(query)
But it get a list of persons that have a birthday in date range. You should change a bit.