My proj has a model that goes like:
class Data(Model):
data = FloatField(verbose_name='Data', null=True, blank=True)
created_at = DateTimeField(verbose_name='Created at')
And my app creates a few hundred logs of this model per day.
I'm trying to sum only the greatest values of each day, without having to iterate over them (using only Django queries).
Is it possible without writing SQL queries?
PS: I'm able to get the greatest 'data' of each day, so the current logic iterates over days and sums the greatest values of each day. But that solution is becoming too slow and I'd like to solve it directly into db level.
Annotations and aggregates to the rescue:
from django.db.models import Sum, Max
from django.db.models.functions import Trunc
report = (Data.objects
.annotate(day=Trunc('created_at', 'day'))
.values('day')
.annotate(greatest=Max('data'))
.values('greatest')
.aggregate(total=Sum('greatest'))
)
print(report['total'])
The resulting SQL is almost simpler than the code:
SELECT SUM("greatest")
FROM
(SELECT MAX("app_data"."data_id") AS "greatest"
FROM "app_data"
GROUP BY DATE_TRUNC('day', "app_data"."created_at")) subquery
If you are using a database backed that supports distinct on fields (like postgres does) you can do.
Data.objects.order_by('created_at__date', '-data').distinct('created_at__date')
Related
Context
There is a dataframe of customer invoices and their due dates.(Identified by customer code)
Week(s) need to be added depending on customer code
Model is created to persist the list of customers and week(s) to be added
What is done so far:
Models.py
class BpShift(models.Model):
bp_name = models.CharField(max_length=50, default='')
bp_code = models.CharField(max_length=15, primary_key=True, default='')
weeks = models.IntegerField(default=0)
helper.py
from .models import BpShift
# used in views later
def week_shift(self, df):
df['DueDateRange'] = df['DueDate'] + datetime.timedelta(
weeks=BpShift.objects.get(pk=df['BpCode']).weeks)
I realised my understanding of Dataframes is seriously flawed.
df['A'] and df['B'] would return Series. Of course, timedelta wouldn't work like this(weeks=BpShift.objects.get(pk=df['BpCode']).weeks).
Dataframe
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d)
Customer List csv
BP Name,BP Code,Week(s)
Customer1,CA0023MY,1
Customer2,CA0064SG,1
Error
BpShift matching query does not exist.
Commentary
I used these methods in hope that I would be able to change the dataframe at once, instead of
using df.iterrows(). I have recently been avoiding for loops like a plague and wondering if this
is the "correct" mentality. Is there any recommended way of doing this? Thanks in advance for any guidance!
This question Python & Pandas: series to timedelta will help to take you from Series to timedelta. And although
pandas.Series(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('weeks', flat=True)
)
will give you a Series of integers, I doubt the order is the same as in df['BpCode']. Because it depends on the django Model and database backend.
So you might be better off to explicitly create not a Series, but a DataFrame with pk and weeks columns so you can use df.join. Something like this
pandas.DataFrame(
BpShift.objects.filter(
pk__in=df['BpCode'].tolist()
).values_list('pk', 'weeks'),
columns=['BpCode', 'weeks'],
)
should give you a DataFrame that you can join with.
So combined this should be the gist of your code:
django_response = [('customer1', 1), ('customer2', '2')]
d = {'BpCode':['customer1','customer2'],'DueDate':['2020-05-30','2020-04-30']}
df = pd.DataFrame(data=d).set_index('BpCode').join(
pd.DataFrame(django_response, columns=['BpCode', 'weeks']).set_index('BpCode')
)
df['DueDate'] = pd.to_datetime(df['DueDate'])
df['weeks'] = pd.to_numeric(df['weeks'])
df['new_duedate'] = df['DueDate'] + df['weeks'] * pd.Timedelta('1W')
print(df)
DueDate weeks new_duedate
BpCode
customer1 2020-05-30 1 2020-06-06
customer2 2020-04-30 2 2020-05-14
You were right to want to avoid looping. This approach gets all the data in one SQL query from your Django model, by using filter. Then does a left join with the DataFrame you already have. Casts the dates and weeks to the right types and then computes a new due date using the whole columns instead of loops over them.
NB the left join will give NaN and NaT for customers that don't exist in your Django database. You can either avoid those rows by passing how='inner' to df.join or handle them whatever way you like.
This is a bleeding-edge feature that I'm currently skewered upon and quickly bleeding out. I want to annotate a subquery-aggregate onto an existing queryset. Doing this before 1.11 either meant custom SQL or hammering the database. Here's the documentation for this, and the example from it:
from django.db.models import OuterRef, Subquery, Sum
comments = Comment.objects.filter(post=OuterRef('pk')).values('post')
total_comments = comments.annotate(total=Sum('length')).values('total')
Post.objects.filter(length__gt=Subquery(total_comments))
They're annotating on the aggregate, which seems weird to me, but whatever.
I'm struggling with this so I'm boiling it right back to the simplest real-world example I have data for. I have Carparks which contain many Spaces. Use Book→Author if that makes you happier but —for now— I just want to annotate on a count of the related model using Subquery*.
spaces = Space.objects.filter(carpark=OuterRef('pk')).values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))
This gives me a lovely ProgrammingError: more than one row returned by a subquery used as an expression and in my head, this error makes perfect sense. The subquery is returning a list of spaces with the annotated-on total.
The example suggested that some sort of magic would happen and I'd end up with a number I could use. But that's not happening here? How do I annotate on aggregate Subquery data?
Hmm, something's being added to my query's SQL...
I built a new Carpark/Space model and it worked. So the next step is working out what's poisoning my SQL. On Laurent's advice, I took a look at the SQL and tried to make it more like the version they posted in their answer. And this is where I found the real problem:
SELECT "bookings_carpark".*, (SELECT COUNT(U0."id") AS "c"
FROM "bookings_space" U0
WHERE U0."carpark_id" = ("bookings_carpark"."id")
GROUP BY U0."carpark_id", U0."space"
)
AS "space_count" FROM "bookings_carpark";
I've highlighted it but it's that subquery's GROUP BY ... U0."space". It's retuning both for some reason. Investigations continue.
Edit 2: Okay, just looking at the subquery SQL I can see that second group by coming through ☹
In [12]: print(Space.objects_standard.filter().values('carpark').annotate(c=Count('*')).values('c').query)
SELECT COUNT(*) AS "c" FROM "bookings_space" GROUP BY "bookings_space"."carpark_id", "bookings_space"."space" ORDER BY "bookings_space"."carpark_id" ASC, "bookings_space"."space" ASC
Edit 3: Okay! Both these models have sort orders. These are being carried through to the subquery. It's these orders that are bloating out my query and breaking it.
I guess this might be a bug in Django but short of removing the Meta-order_by on both these models, is there any way I can unsort a query at querytime?
*I know I could just annotate a Count for this example. My real purpose for using this is a much more complex filter-count but I can't even get this working.
Shazaam! Per my edits, an additional column was being output from my subquery. This was to facilitate ordering (which just isn't required in a COUNT).
I just needed to remove the prescribed meta-order from the model. You can do this by just adding an empty .order_by() to the subquery. In my code terms that meant:
from django.db.models import Count, OuterRef, Subquery
spaces = Space.objects.filter(carpark=OuterRef('pk')).order_by().values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))
And that works. Superbly. So annoying.
It's also possible to create a subclass of Subquery, that changes the SQL it outputs. For instance, you can use:
class SQCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = models.IntegerField()
You then use this as you would the original Subquery class:
spaces = Space.objects.filter(carpark=OuterRef('pk')).values('pk')
Carpark.objects.annotate(space_count=SQCount(spaces))
You can use this trick (at least in postgres) with a range of aggregating functions: I often use it to build up an array of values, or sum them.
I just bumped into a VERY similar case, where I had to get seat reservations for events where the reservation status is not cancelled. After trying to figure the problem out for hours, here's what I've seen as the root cause of the problem:
Preface: this is MariaDB, Django 1.11.
When you annotate a query, it gets a GROUP BY clause with the fields you select (basically what's in your values() query selection). After investigating with the MariaDB command line tool why I'm getting NULLs or Nones on the query results, I've came to the conclusion that the GROUP BY clause will cause the COUNT() to return NULLs.
Then, I started diving into the QuerySet interface to see how can I manually, forcibly remove the GROUP BY from the DB queries, and came up with the following code:
from django.db.models.fields import PositiveIntegerField
reserved_seats_qs = SeatReservation.objects.filter(
performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
).values('id').annotate(
count=Count('id')).values('count')
# Query workaround: remove GROUP BY from subquery. Test this
# vigorously!
reserved_seats_qs.query.group_by = []
performances_qs = Performance.objects.annotate(
reserved_seats=Subquery(
queryset=reserved_seats_qs,
output_field=PositiveIntegerField()))
print(performances_qs[0].reserved_seats)
So basically, you have to manually remove/update the group_by field on the subquery's queryset in order for it to not have a GROUP BY appended on it on execution time. Also, you'll have to specify what output field the subquery will have, as it seems that Django fails to recognize it automatically, and raises exceptions on the first evaluation of the queryset. Interestingly, the second evaluation succeeds without it.
I believe this is a Django bug, or an inefficiency in subqueries. I'll create a bug report about it.
Edit: the bug report is here.
Problem
The problem is that Django adds GROUP BY as soon as it sees using an aggregate function.
Solution
So you can just create your own aggregate function but so that Django thinks it is not aggregate. Just like this:
total_comments = Comment.objects.filter(
post=OuterRef('pk')
).order_by().annotate(
total=Func(F('length'), function='SUM')
).values('total')
Post.objects.filter(length__gt=Subquery(total_comments))
This way you get the SQL query like this:
SELECT "testapp_post"."id", "testapp_post"."length"
FROM "testapp_post"
WHERE "testapp_post"."length" > (SELECT SUM(U0."length") AS "total"
FROM "testapp_comment" U0
WHERE U0."post_id" = "testapp_post"."id")
So you can even use aggregate subqueries in aggregate functions.
Example
You can count the number of workdays between two dates, excluding weekends and holidays, and aggregate and summarize them by employee:
class NonWorkDay(models.Model):
date = DateField()
class WorkPeriod(models.Model):
employee = models.ForeignKey(User, on_delete=models.CASCADE)
start_date = DateField()
end_date = DateField()
number_of_non_work_days = NonWorkDay.objects.filter(
date__gte=OuterRef('start_date'),
date__lte=OuterRef('end_date'),
).annotate(
cnt=Func('id', function='COUNT')
).values('cnt')
WorkPeriod.objects.values('employee').order_by().annotate(
number_of_word_days=Sum(F('end_date__year') - F('start_date__year') - number_of_non_work_days)
)
Hope this will help!
A solution which would work for any general aggregation could be implemented using Window classes from Django 2.0. I have added this to the Django tracker ticket as well.
This allows the aggregation of annotated values by calculating the aggregate over partitions based on the outer query model (in the GROUP BY clause), then annotating that data to every row in the subquery queryset. The subquery can then use the aggregated data from the first row returned and ignore the other rows.
Performance.objects.annotate(
reserved_seats=Subquery(
SeatReservation.objects.filter(
performance=OuterRef(name='pk'),
status__in=TAKEN_TYPES,
).annotate(
reserved_seat_count=Window(
expression=Count('pk'),
partition_by=[F('performance')]
),
).values('reserved_seat_count')[:1],
output_field=FloatField()
)
)
If I understand correctly, you are trying to count Spaces available in a Carpark. Subquery seems overkill for this, the good old annotate alone should do the trick:
Carpark.objects.annotate(Count('spaces'))
This will include a spaces__count value in your results.
OK, I have seen your note...
I was also able to run your same query with other models I had at hand. The results are the same, so the query in your example seems to be OK (tested with Django 1.11b1):
activities = Activity.objects.filter(event=OuterRef('pk')).values('event')
count_activities = activities.annotate(c=Count('*')).values('c')
Event.objects.annotate(spaces__count=Subquery(count_activities))
Maybe your "simplest real-world example" is too simple... can you share the models or other information?
"works for me" doesn't help very much. But.
I tried your example on some models I had handy (the Book -> Author type), it works fine for me in django 1.11b1.
Are you sure you're running this in the right version of Django? Is this the actual code you're running? Are you actually testing this not on carpark but some more complex model?
Maybe try to print(thequery.query) to see what SQL it's trying to run in the database. Below is what I got with my models (edited to fit your question):
SELECT (SELECT COUNT(U0."id") AS "c"
FROM "carparks_spaces" U0
WHERE U0."carpark_id" = ("carparks_carpark"."id")
GROUP BY U0."carpark_id") AS "space_count" FROM "carparks_carpark"
Not really an answer, but hopefully it helps.
I found some solutions here and in the django documentation, but I could not manage to make one query work the way I wanted.
I have the following model:
class Inventory(models.Model):
blindid = models.CharField(max_length=20)
massug = models.IntegerField()
I want to count the number of Blind_ID and then sum the massug after they were grouped.
My currently Django ORM
samples = Inventory.objects.values('blindid', 'massug').annotate(aliquots=Count('blindid'), total=Sum('massug'))
It's not counting correctly (it shows only one), thus it 's not summing correctly. It seems it is only getting the first result... I tried to use Count('blindid', distinct=True) and Count('blindid', distinct=False) as well.
This is the query result using samples.query. Django is grouping by the two columns...
SELECT "inventory"."blindid", "inventory"."massug", COUNT("inventory"."blindid") AS "aliquots", SUM("inventory"."massug") AS "total" FROM "inventory" GROUP BY "inventory"."blindid", "inventory"."massug"
This should be the raw sql
SELECT blindid,
Count(blindid) AS aliquots,
Sum(massug) AS total
FROM inventory
GROUP BY blindid
Try this:
samples = Inventory.objects.values('blindid').annotate(aliquots=Count('blindid'), total=Sum('massug'))
I have a little problem with getting latest foreign key value in my django app. Here are my two models:
class Stock(models.Model):
...
class Dividend(models.Model):
date = models.DateField('pay date')
stock = models.ForeignKey(Stock, related_name="dividends")
class Meta:
ordering = ["date"]
I would like to get latest dividend from stock object. So basically this - stock.dividends.latest('date'). However, everytime I call stock.dividends.latest('date'), it fires up sql query to get latest dividend. I have latest() method in for cycle for every stock I have. I would like to avoid these sql queries. May I somehow define new method in class Stock that would get latest dividend within sql query for stock object?
I cannot change default ordering from "date" to "-date".
Using select_related('dividends') loads dividends objects with stock, but latest probably uses order_by and it requires sql query anyway. :(
EDIT1: To make more clear what I want, here is an example. Let's say I have 100 symbols in shares.keys():
for stock in Stock.objects.filter(symbol__in=shares.keys()): # 1 sql query
latest_dividend = stock.dividends.latest('date') # 100 sql queries
... #do something with latest dividend
Well and in some cases I might have 500 symbols in shares.keys(). That is why I need to avoid making sql queries on getting latest dividend for stock.
I have the same problem with you, so I tested many Django queries. Finally, I found out that we can use this:
Stock.objects.all().annotate(latest_date=Max('dividends__date')).filter(dividends__date=F('latest_date')).values('dividends')
I'm not sure my solution is the best, but here it is (works only with PostgreSQL):
stocks = list(Stock.objects.filter(**something))
dividends = Dividend.objects.filter(
stock__in=stocks,
).order_by(
'stock_id',
'-date'
).distinct(
'stock_id',
)
dividends_dict = {d.stock_id: d for d in dividends}
for stock in stocks:
stock.latest_dividend = dividends_dict.get(stock.id)
I'm a little confused by your question, I'm assuming you are trying to access the dividends from your stock object in order to limit your queries to the database. I believe that is the least number queries of possible.
stock_options = stock.objects.get(pk=your_query)
order_options = stock.dividend_set.order_by('-date')[:5]
likeon: Thanks for your answer. But I think I can avoid initializing that large dictionary (I have 5000 stocks and 280 000 dividends). But your list gave me an idea. Your code requires 2 sql queries. Here is my example (EDIT1).
for stock in Stock.objects.filter(symbol__in=shares.keys())\
.prefetch_related('dividends'): # 2 sql queries
latest_dividend = list(stock.dividends.all())[-1] # 0 sql queries
... #do something with latest_dividend
My code also requires 2 sql queries, but I do not have to reorder it and create list from stocks and all 280 000 dividends (I only create dict from current stock dividends every cycle). May be creating one dict is quicker than creating len(shares.keys()) dicts, not sure.
I thought there would be easier solution (avoid creating list/dictionary from dividends), but this is good enough for now. Thanks for answers!
As long as I understood you can do it this way:
stock.dividends.last()
as implementation in Django is like this:
def first(self):
"""Return the first object of a query or None if no match is found."""
for obj in (self if self.ordered else self.order_by('pk'))[:1]:
return obj
Also, you can use .latest(*fields, field_name=None) too.
Hi I am writing a Django view which ouputs data for graphing on the client side (High Charts). The data is climate data with a given parameter recorded once per day.
My query is this:
format = '%Y-%m-%d'
sd = datetime.datetime.strptime(startdate, format)
ed = datetime.datetime.strptime(enddate, format)
data = Climate.objects.filter(recorded_on__range = (sd, ed)).order_by('recorded_on')
Now, as the range is increased the dataset obviously gets larger and this does not present well on the graph (aside from slowing things down considerably).
Is there an way to group my data as averages in time periods - specifically average for each month or average for each year?
I realize this could be done in SQL as mentioned here: django aggregation to lower resolution using grouping by a date range
But I would like to know if there is a handy way in Django itself.
Or is it perhaps better to modify the db directly and use a script to populate month and year fields from the timestamp?
Any help much appreciated.
Have you tried using django-qsstats-magic (https://github.com/kmike/django-qsstats-magic)?
It makes things very easy for charting, here is a timeseries example from their docs:
from django.contrib.auth.models import User
import datetime, qsstats
qs = User.objects.all()
qss = qsstats.QuerySetStats(qs, 'date_joined')
today = datetime.date.today()
seven_days_ago = today - datetime.timedelta(days=7)
time_series = qss.time_series(seven_days_ago, today)
print 'New users in the last 7 days: %s' % [t[1] for t in time_series]