Django annotation with nested filter - django

Is it possible to filter within an annotation?
In my mind something like this (which doesn't actually work)
Student.objects.all().annotate(Count('attendance').filter(type="Excused"))
The resultant table would have every student with the number of excused absences. Looking through documentation filters can only be before or after the annotation which would not yield the desired results.
A workaround is this
for student in Student.objects.all():
student.num_excused_absence = Attendance.objects.filter(student=student, type="Excused").count()
This works but does many queries, in a real application this can get impractically long. I think this type of statement is possible in SQL but would prefer to stay with ORM if possible. I even tried making two separate queries (one for all students, another to get the total) and combined them with |. The combination changed the total :(
Some thoughts after reading answers and comments
I solved the attendance problem using extra sql here.
Timmy's blog post was useful. My answer is based off of it.
hash1baby's answer works but seems equally complex as sql. It also requires executing sql then adding the result in a for loop. This is bad for me because I'm stacking lots of these filtering queries together. My solution builds up a big queryset with lots of filters and extra and executes it all at once.
If performance is no issue - I suggest the for loop work around. It's by far the easiest to understand.

As of Django 1.8 you can do this directly in the ORM:
students = Student.objects.all().annotate(num_excused_absences=models.Sum(
models.Case(
models.When(absence__type='Excused', then=1),
default=0,
output_field=models.IntegerField()
)))
Answer adapted from another SO question on the same topic
I haven't tested the sample above but did accomplish something similar in my own app.

You are correct - django does not allow you to filter the related objects being counted, without also applying the filter to the primary objects, and therefore excluding those primary objects with a no related objects after filtering.
But, in a bit of abstraction leakage, you can count groups by using a values query.
So, I collect the absences in a dictionary, and use that in a loop. Something like this:
# a query for students
students = Students.objects.all()
# a query to count the student attendances, grouped by type.
attendance_counts = Attendence(student__in=students).values('student', 'type').annotate(abs=Count('pk'))
# regroup that into a dictionary {student -> { type -> count }}
from itertools import groupby
attendance_s_t = dict((s, (dict(t, c) for (s, t, c) in g)) for s, g in groupby(attendance_counts, lambda (s, t, c): s))
# then use them efficiently:
for student in students:
student.absences = attendance_s_t.get(student.pk, {}).get('Excused', 0)

Maybe this will work for you:
excused = Student.objects.filter(attendance__type='Excused').annotate(abs=Count('attendance'))
You need to filter the Students you're looking for first to just those with excused absences and then annotate the count of them.
Here's a link to the Django Aggregation Docs where it discusses filtering order.

Related

How to get boolean result in annotate django?

I have a filter which should return a queryset with 2 objects, and should have one different field. for example:
obj_1 = (name='John', age='23', is_fielder=True)
obj_2 = (name='John', age='23', is_fielder=False)
Both the objects are of same model, but different primary key. I tried usign the below filter:
qs = Model.objects.filter(name='John', age='23').annotate(is_fielder=F('plays__outdoor_game_role')=='Fielder')
I used annotate first time, but it gave me the below error:
TypeError: QuerySet.annotate() received non-expression(s): False.
I am new to Django, so what am I doing wrong, and what should be the annotate to get the required objects as shown above?
The solution by #ktowen works well, quite straightforward.
Here is another solution I am using, hope it is helpful too.
queryset = queryset.annotate(is_fielder=ExpressionWrapper(
Q(plays__outdoor_game_role='Fielder'),
output_field=BooleanField(),
),)
Here are some explanations for those who are not familiar with Django ORM:
Annotate make a new column/field on the fly, in this case, is_fielder. This means you do not have a field named is_fielder in your model while you can use it like plays.outdor_game_role.is_fielder after you add this 'annotation'. Annotate is extremely useful and flexible, can be combined with almost every other expression, should be a MUST-KNOWN method in Django ORM.
ExpressionWrapper basically gives you space to wrap a more complecated combination of conditions, use in a format like ExpressionWrapper(expression, output_field). It is useful when you are combining different types of fields or want to specify an output type since Django cannot tell automatically.
Q object is a frequently used expression to specify a condition, I think the most powerful part is that it is possible to chain the conditions:
AND (&): filter(Q(condition1) & Q(condition2))
OR (|): filter(Q(condition1) | Q(condition2))
Negative(~): filter(~Q(condition))
It is possible to use Q with normal conditions like below:
(Q(condition1)|id__in=[list])
The point is Q object must come to the first or it will not work.
Case When(then) can be simply explained as if con1 elif con2 elif con3 .... It is quite powerful and personally, I love to use this to customize an ordering object for a queryset.
For example, you need to return a queryset of watch history items, and those must be in an order of watching by the user. You can do it with for loop to keep the order but this will generate plenty of similar queries. A more elegant way with Case When would be:
item_ids = [list]
ordering = Case(*[When(pk=pk, then=pos)
for pos, pk in enumerate(item_ids)])
watch_history = Item.objects.filter(id__in=item_ids)\
.order_by(ordering)
As you can see, by using Case When(then) it is possible to bind those very concrete relations, which could be considered as 1) a pinpoint/precise condition expression and 2) especially useful in a sequential multiple conditions case.
You can use Case/When with annotate
from django.db.models import Case, BooleanField, Value, When
Model.objects.filter(name='John', age='23').annotate(
is_fielder=Case(
When(plays__outdoor_game_role='Fielder', then=Value(True)),
default=Value(False),
output_field=BooleanField(),
),
)

Django 1.11 Annotating a Subquery Aggregate

This is a bleeding-edge feature that I'm currently skewered upon and quickly bleeding out. I want to annotate a subquery-aggregate onto an existing queryset. Doing this before 1.11 either meant custom SQL or hammering the database. Here's the documentation for this, and the example from it:
from django.db.models import OuterRef, Subquery, Sum
comments = Comment.objects.filter(post=OuterRef('pk')).values('post')
total_comments = comments.annotate(total=Sum('length')).values('total')
Post.objects.filter(length__gt=Subquery(total_comments))
They're annotating on the aggregate, which seems weird to me, but whatever.
I'm struggling with this so I'm boiling it right back to the simplest real-world example I have data for. I have Carparks which contain many Spaces. Use Book→Author if that makes you happier but —for now— I just want to annotate on a count of the related model using Subquery*.
spaces = Space.objects.filter(carpark=OuterRef('pk')).values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))
This gives me a lovely ProgrammingError: more than one row returned by a subquery used as an expression and in my head, this error makes perfect sense. The subquery is returning a list of spaces with the annotated-on total.
The example suggested that some sort of magic would happen and I'd end up with a number I could use. But that's not happening here? How do I annotate on aggregate Subquery data?
Hmm, something's being added to my query's SQL...
I built a new Carpark/Space model and it worked. So the next step is working out what's poisoning my SQL. On Laurent's advice, I took a look at the SQL and tried to make it more like the version they posted in their answer. And this is where I found the real problem:
SELECT "bookings_carpark".*, (SELECT COUNT(U0."id") AS "c"
FROM "bookings_space" U0
WHERE U0."carpark_id" = ("bookings_carpark"."id")
GROUP BY U0."carpark_id", U0."space"
)
AS "space_count" FROM "bookings_carpark";
I've highlighted it but it's that subquery's GROUP BY ... U0."space". It's retuning both for some reason. Investigations continue.
Edit 2: Okay, just looking at the subquery SQL I can see that second group by coming through ☹
In [12]: print(Space.objects_standard.filter().values('carpark').annotate(c=Count('*')).values('c').query)
SELECT COUNT(*) AS "c" FROM "bookings_space" GROUP BY "bookings_space"."carpark_id", "bookings_space"."space" ORDER BY "bookings_space"."carpark_id" ASC, "bookings_space"."space" ASC
Edit 3: Okay! Both these models have sort orders. These are being carried through to the subquery. It's these orders that are bloating out my query and breaking it.
I guess this might be a bug in Django but short of removing the Meta-order_by on both these models, is there any way I can unsort a query at querytime?
*I know I could just annotate a Count for this example. My real purpose for using this is a much more complex filter-count but I can't even get this working.
Shazaam! Per my edits, an additional column was being output from my subquery. This was to facilitate ordering (which just isn't required in a COUNT).
I just needed to remove the prescribed meta-order from the model. You can do this by just adding an empty .order_by() to the subquery. In my code terms that meant:
from django.db.models import Count, OuterRef, Subquery
spaces = Space.objects.filter(carpark=OuterRef('pk')).order_by().values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))
And that works. Superbly. So annoying.
It's also possible to create a subclass of Subquery, that changes the SQL it outputs. For instance, you can use:
class SQCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = models.IntegerField()
You then use this as you would the original Subquery class:
spaces = Space.objects.filter(carpark=OuterRef('pk')).values('pk')
Carpark.objects.annotate(space_count=SQCount(spaces))
You can use this trick (at least in postgres) with a range of aggregating functions: I often use it to build up an array of values, or sum them.
I just bumped into a VERY similar case, where I had to get seat reservations for events where the reservation status is not cancelled. After trying to figure the problem out for hours, here's what I've seen as the root cause of the problem:
Preface: this is MariaDB, Django 1.11.
When you annotate a query, it gets a GROUP BY clause with the fields you select (basically what's in your values() query selection). After investigating with the MariaDB command line tool why I'm getting NULLs or Nones on the query results, I've came to the conclusion that the GROUP BY clause will cause the COUNT() to return NULLs.
Then, I started diving into the QuerySet interface to see how can I manually, forcibly remove the GROUP BY from the DB queries, and came up with the following code:
from django.db.models.fields import PositiveIntegerField
reserved_seats_qs = SeatReservation.objects.filter(
performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
).values('id').annotate(
count=Count('id')).values('count')
# Query workaround: remove GROUP BY from subquery. Test this
# vigorously!
reserved_seats_qs.query.group_by = []
performances_qs = Performance.objects.annotate(
reserved_seats=Subquery(
queryset=reserved_seats_qs,
output_field=PositiveIntegerField()))
print(performances_qs[0].reserved_seats)
So basically, you have to manually remove/update the group_by field on the subquery's queryset in order for it to not have a GROUP BY appended on it on execution time. Also, you'll have to specify what output field the subquery will have, as it seems that Django fails to recognize it automatically, and raises exceptions on the first evaluation of the queryset. Interestingly, the second evaluation succeeds without it.
I believe this is a Django bug, or an inefficiency in subqueries. I'll create a bug report about it.
Edit: the bug report is here.
Problem
The problem is that Django adds GROUP BY as soon as it sees using an aggregate function.
Solution
So you can just create your own aggregate function but so that Django thinks it is not aggregate. Just like this:
total_comments = Comment.objects.filter(
post=OuterRef('pk')
).order_by().annotate(
total=Func(F('length'), function='SUM')
).values('total')
Post.objects.filter(length__gt=Subquery(total_comments))
This way you get the SQL query like this:
SELECT "testapp_post"."id", "testapp_post"."length"
FROM "testapp_post"
WHERE "testapp_post"."length" > (SELECT SUM(U0."length") AS "total"
FROM "testapp_comment" U0
WHERE U0."post_id" = "testapp_post"."id")
So you can even use aggregate subqueries in aggregate functions.
Example
You can count the number of workdays between two dates, excluding weekends and holidays, and aggregate and summarize them by employee:
class NonWorkDay(models.Model):
date = DateField()
class WorkPeriod(models.Model):
employee = models.ForeignKey(User, on_delete=models.CASCADE)
start_date = DateField()
end_date = DateField()
number_of_non_work_days = NonWorkDay.objects.filter(
date__gte=OuterRef('start_date'),
date__lte=OuterRef('end_date'),
).annotate(
cnt=Func('id', function='COUNT')
).values('cnt')
WorkPeriod.objects.values('employee').order_by().annotate(
number_of_word_days=Sum(F('end_date__year') - F('start_date__year') - number_of_non_work_days)
)
Hope this will help!
A solution which would work for any general aggregation could be implemented using Window classes from Django 2.0. I have added this to the Django tracker ticket as well.
This allows the aggregation of annotated values by calculating the aggregate over partitions based on the outer query model (in the GROUP BY clause), then annotating that data to every row in the subquery queryset. The subquery can then use the aggregated data from the first row returned and ignore the other rows.
Performance.objects.annotate(
reserved_seats=Subquery(
SeatReservation.objects.filter(
performance=OuterRef(name='pk'),
status__in=TAKEN_TYPES,
).annotate(
reserved_seat_count=Window(
expression=Count('pk'),
partition_by=[F('performance')]
),
).values('reserved_seat_count')[:1],
output_field=FloatField()
)
)
If I understand correctly, you are trying to count Spaces available in a Carpark. Subquery seems overkill for this, the good old annotate alone should do the trick:
Carpark.objects.annotate(Count('spaces'))
This will include a spaces__count value in your results.
OK, I have seen your note...
I was also able to run your same query with other models I had at hand. The results are the same, so the query in your example seems to be OK (tested with Django 1.11b1):
activities = Activity.objects.filter(event=OuterRef('pk')).values('event')
count_activities = activities.annotate(c=Count('*')).values('c')
Event.objects.annotate(spaces__count=Subquery(count_activities))
Maybe your "simplest real-world example" is too simple... can you share the models or other information?
"works for me" doesn't help very much. But.
I tried your example on some models I had handy (the Book -> Author type), it works fine for me in django 1.11b1.
Are you sure you're running this in the right version of Django? Is this the actual code you're running? Are you actually testing this not on carpark but some more complex model?
Maybe try to print(thequery.query) to see what SQL it's trying to run in the database. Below is what I got with my models (edited to fit your question):
SELECT (SELECT COUNT(U0."id") AS "c"
FROM "carparks_spaces" U0
WHERE U0."carpark_id" = ("carparks_carpark"."id")
GROUP BY U0."carpark_id") AS "space_count" FROM "carparks_carpark"
Not really an answer, but hopefully it helps.

Django QuerySet update performance

Which one would be better for performance?
We take a slice of products. which make us impossible to bulk update.
products = Product.objects.filter(featured=True).order_by("-modified_on")[3:]
for product in products:
product.featured = False
product.save()
or (invalid)
for product in products.iterator():
product.update(featured=False)
I have tried QuerySet's in statement too as following.
Product.objects.filter(pk__in=products).update(featured=False)
This line works fine on SQLite. But, it rises following exception on MySQL. So, I couldn't use that.
DatabaseError: (1235, "This version of MySQL doesn't yet support
'LIMIT & IN/ALL/ANY/SOME subquery'")
Edit: Also iterator() method causes re-evaluate the query. So, it is bad for performance.
As #Chris Pratt pointed out in comments, the second example is invalid because the objects don't have update methods. Your first example will require queries equal to results+1 since it has to update each object. That might really be costly if you have 1000 products. Ideally you do want to reduce this to a more fixed expense if possible.
This is a similar situation to another question:
Django: Cannot update a query once a slice has been taken
That being said, you would have to do it in at least 2 queries, but you have to be a bit sneaky on how to construct the LIMIT...
Using Q objects for complex queries:
# get the IDs we want to exclude
products = Product.objects.filter(featured=True).order_by("-modified_on")[:3]
# flatten them into just a list of ids
ids = products.values_list('id', flat=True)
# Now use the Q object to construct a complex query
from django.db.models import Q
# This builds a list of "AND id NOT EQUAL TO i"
limits = [~Q(id=i) for i in ids]
Product.objects.filter(featured=True, *limits).update(featured=False)
In some cases it's acceptable to cache QuerySet in array
products = list(products)
Product.objects.filter(pk__in=products).update(featured=False)
Small optimization with values_list
products_id = list(products.values_list('id', flat=True)
Product.objects.filter(pk__in=products_id).update(featured=False)

Select DISTINCT individual columns in django?

I'm curious if there's any way to do a query in Django that's not a "SELECT * FROM..." underneath. I'm trying to do a "SELECT DISTINCT columnName FROM ..." instead.
Specifically I have a model that looks like:
class ProductOrder(models.Model):
Product = models.CharField(max_length=20, promary_key=True)
Category = models.CharField(max_length=30)
Rank = models.IntegerField()
where the Rank is a rank within a Category. I'd like to be able to iterate over all the Categories doing some operation on each rank within that category.
I'd like to first get a list of all the categories in the system and then query for all products in that category and repeat until every category is processed.
I'd rather avoid raw SQL, but if I have to go there, that'd be fine. Though I've never coded raw SQL in Django/Python before.
One way to get the list of distinct column names from the database is to use distinct() in conjunction with values().
In your case you can do the following to get the names of distinct categories:
q = ProductOrder.objects.values('Category').distinct()
print q.query # See for yourself.
# The query would look something like
# SELECT DISTINCT "app_productorder"."category" FROM "app_productorder"
There are a couple of things to remember here. First, this will return a ValuesQuerySet which behaves differently from a QuerySet. When you access say, the first element of q (above) you'll get a dictionary, NOT an instance of ProductOrder.
Second, it would be a good idea to read the warning note in the docs about using distinct(). The above example will work but all combinations of distinct() and values() may not.
PS: it is a good idea to use lower case names for fields in a model. In your case this would mean rewriting your model as shown below:
class ProductOrder(models.Model):
product = models.CharField(max_length=20, primary_key=True)
category = models.CharField(max_length=30)
rank = models.IntegerField()
It's quite simple actually if you're using PostgreSQL, just use distinct(columns) (documentation).
Productorder.objects.all().distinct('category')
Note that this feature has been included in Django since 1.4
User order by with that field, and then do distinct.
ProductOrder.objects.order_by('category').values_list('category', flat=True).distinct()
The other answers are fine, but this is a little cleaner, in that it only gives the values like you would get from a DISTINCT query, without any cruft from Django.
>>> set(ProductOrder.objects.values_list('category', flat=True))
{u'category1', u'category2', u'category3', u'category4'}
or
>>> list(set(ProductOrder.objects.values_list('category', flat=True)))
[u'category1', u'category2', u'category3', u'category4']
And, it works without PostgreSQL.
This is less efficient than using a .distinct(), presuming that DISTINCT in your database is faster than a python set, but it's great for noodling around the shell.
Update:
This is answer is great for making queries in the Django shell during development. DO NOT use this solution in production unless you are absolutely certain that you will always have a trivially small number of results before set is applied. Otherwise, it's a terrible idea from a performance standpoint.

django's .extra(where= clauses are clobbered by table-renaming .filter(foo__in=... subselects

The short of it is, the table names of all queries that are inside a filter get renamed to u0, u1, ..., so my extra where clauses won't know what table to point to. I would love to not have to hand-make all the queries for every way I might subselect on this data, and my current workaround is to turn my extra'd queries into pk values_lists, but those are really slow and something of an abomination.
Here's what this all looks like. You can mostly ignore the details of what goes in the extra of this manager method, except the first sql line which points to products_product.id:
def by_status(self, *statii):
return self.extra(where=["""products_product.id IN
(SELECT recent.product_id
FROM (
SELECT product_id, MAX(start_date) AS latest
FROM products_productstatus
GROUP BY product_id
) AS recent
JOIN products_productstatus AS ps ON ps.product_id = recent.product_id
WHERE ps.start_date = recent.latest
AND ps.status IN (%s))""" % (', '.join([str(stat) for stat in statii]),)])
Which works wonderfully for all the situations involving only the products_product table.
When I want these products as a subselect, i do:
Piece.objects.filter(
product__in=Product.objects.filter(
pk__in=list(
Product.objects.by_status(FEATURED).values_list('id', flat=True))))
How can I keep the generalized abilities of a query set, yet still use an extra where clause?
At first: the issue is not totally clear to me. Is the second code block in your question the actual code you want to execute? If this is the case the query should work as expected since there is no subselect performed.
I assume so that you want to use the second code block without the list() around the subselect to prevent a second query being performed.
The django documentation refers to this issue in the documentation about the extra method. However its not very easy to overcome this issue.
The easiest but most "hakish" solution is to observe which table alias is produced by django for the table you want to query in the extra method. You can rely on the persistent naming of this alias as long as you construct the query always in the same fashion (you don't change the order of multiple extra methods or filter calls that cause a join).
You can inspect a query that will be execute in the DB queryset by using:
print Model.objects.filter(...).query
This will reveal the aliases that are used for the tables you want to query.
As of Django 1.11, you should be able to use Subquery and OuterRef to generate an equivalent query to your extra (using a correlated subquery rather than a join):
def by_status(self, *statii):
return self.filter(
id__in=Subquery(ProductStatus.values("product_id").filter(
status__in=statii,
product__in=Subquery(ProductStatus.objects.values(
"product_id",
).annotate(
latest=Max("start_date"),
).filter(
latest=OuterRef("start_date"),
).values("product_id"),
),
)
You could probably do it with Window expressions as well (as of Django 2.0).
Note that this is untested, so may need some tweaks.