multiple annotate Sum terms yields inflated answer - django

In the following setup, I'd like a QuerySet with a list of projects, each annotated with the sum of all its task durations (as tasks_duration) and the sum of all of its tasks' subtask durations (as subtasks_duration). My models (simplified) look like this:
class Project(models.Model):
pass
class Task(models.Model):
project = models.ForeignKey(Project)
duration = models.IntegerField(blank=True, null=True)
class SubTask(models.Model):
task = models.ForeignKey(Task)
duration = models.IntegerField(blank=True, null=True)
I make my QuerySet like this:
Projects.objects.annotate(tasks_duration=Sum('task__duration'), subtasks_duration=Sum('task__subtask__duration'))
Related to the behaviour explained in Django annotate() multiple times causes wrong answers I get a tasks_duration that is much higher than it should be. The multiple annotate(Sum()) clauses yield multiple left inner joins in the resultant SQL. With only a single annotate(Sum()) term for tasks_duration, the result is correct. However, I'd like to have both tasks_duration and subtasks_duration.
What would be a suitable way to do this query? I have a working solution that does it per-project, but that's expectedly unusably slow. I also have something similar working with an extra() call, but I'd really like to know if what I want is possible with pure Django.

The bug is reported here but it's not solved yet even in Django 1.11. The issue is related to joining two tables in reverse relations.
Notice that distinct parameter works well for Count but not for Sum. So you can use a trick and write an ORM like below:
Projects.objects.annotate(
temp_tasks_duration=Sum('task__duration'),
temp_subtasks_duration=Sum('task__subtask__duration'),
tasks_count=Count('task'),
tasks_count_distinct=Count('task', distinct=True),
task_subtasks_count=Count('task__subtask'),
task_subtasks_count_distinct=Count('task__subtask', distinct=True),
).annotate(
tasks_duration=F('temp_tasks_duration')*F('tasks_count_distinct')/F('tasks_count'),
subtasks_duration=F('temp_subtasks_duration')*F('subtasks_count_distinct')/F('subtasks_count'),
)
Update:
I found that you need to use Subquery. In the following solution, firstly you filter tasks for related to the outerref (OuterRef references to the outer query, so the tasks are filtered for each Project), then you group the tasks by 'project', so that the Sum applies on all the tasks of each projects and returns just one result if any task exists for the project (you have filtered by 'project' and then grouped by that same field; That's why just one group can be there.) or None otherwise. The result would be None if the project has no task, that means we can not use [0] to select the calculated sum.
from django.db.models import Subquery, OuterRef
Projects.objects.annotate(
tasks_duration=Subquery(
Task.objects.filter(
project=OuterRef('pk')
).values(
'project'
).annotate(
the_sum=Sum('task__duration'),
).values('the_sum')[:1]
),
subtasks_duration=Sum('task__subtask__duration')
)
Running this code will send just one query to the database, so the performance is great.

I get this error as well. Exact same code. It works if I do the aggregation separately, but once I try to get both sums at the same time, one of them gets a factor 2 higher, and the other a factor 3.
I have no idea why Django behaves this way. I have filed a bug report here:
https://code.djangoproject.com/ticket/19011
You might be interested in following it as well.

Related

Django 1.11 Annotating a Subquery Aggregate

This is a bleeding-edge feature that I'm currently skewered upon and quickly bleeding out. I want to annotate a subquery-aggregate onto an existing queryset. Doing this before 1.11 either meant custom SQL or hammering the database. Here's the documentation for this, and the example from it:
from django.db.models import OuterRef, Subquery, Sum
comments = Comment.objects.filter(post=OuterRef('pk')).values('post')
total_comments = comments.annotate(total=Sum('length')).values('total')
Post.objects.filter(length__gt=Subquery(total_comments))
They're annotating on the aggregate, which seems weird to me, but whatever.
I'm struggling with this so I'm boiling it right back to the simplest real-world example I have data for. I have Carparks which contain many Spaces. Use Book→Author if that makes you happier but —for now— I just want to annotate on a count of the related model using Subquery*.
spaces = Space.objects.filter(carpark=OuterRef('pk')).values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))
This gives me a lovely ProgrammingError: more than one row returned by a subquery used as an expression and in my head, this error makes perfect sense. The subquery is returning a list of spaces with the annotated-on total.
The example suggested that some sort of magic would happen and I'd end up with a number I could use. But that's not happening here? How do I annotate on aggregate Subquery data?
Hmm, something's being added to my query's SQL...
I built a new Carpark/Space model and it worked. So the next step is working out what's poisoning my SQL. On Laurent's advice, I took a look at the SQL and tried to make it more like the version they posted in their answer. And this is where I found the real problem:
SELECT "bookings_carpark".*, (SELECT COUNT(U0."id") AS "c"
FROM "bookings_space" U0
WHERE U0."carpark_id" = ("bookings_carpark"."id")
GROUP BY U0."carpark_id", U0."space"
)
AS "space_count" FROM "bookings_carpark";
I've highlighted it but it's that subquery's GROUP BY ... U0."space". It's retuning both for some reason. Investigations continue.
Edit 2: Okay, just looking at the subquery SQL I can see that second group by coming through ☹
In [12]: print(Space.objects_standard.filter().values('carpark').annotate(c=Count('*')).values('c').query)
SELECT COUNT(*) AS "c" FROM "bookings_space" GROUP BY "bookings_space"."carpark_id", "bookings_space"."space" ORDER BY "bookings_space"."carpark_id" ASC, "bookings_space"."space" ASC
Edit 3: Okay! Both these models have sort orders. These are being carried through to the subquery. It's these orders that are bloating out my query and breaking it.
I guess this might be a bug in Django but short of removing the Meta-order_by on both these models, is there any way I can unsort a query at querytime?
*I know I could just annotate a Count for this example. My real purpose for using this is a much more complex filter-count but I can't even get this working.
Shazaam! Per my edits, an additional column was being output from my subquery. This was to facilitate ordering (which just isn't required in a COUNT).
I just needed to remove the prescribed meta-order from the model. You can do this by just adding an empty .order_by() to the subquery. In my code terms that meant:
from django.db.models import Count, OuterRef, Subquery
spaces = Space.objects.filter(carpark=OuterRef('pk')).order_by().values('carpark')
count_spaces = spaces.annotate(c=Count('*')).values('c')
Carpark.objects.annotate(space_count=Subquery(count_spaces))
And that works. Superbly. So annoying.
It's also possible to create a subclass of Subquery, that changes the SQL it outputs. For instance, you can use:
class SQCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = models.IntegerField()
You then use this as you would the original Subquery class:
spaces = Space.objects.filter(carpark=OuterRef('pk')).values('pk')
Carpark.objects.annotate(space_count=SQCount(spaces))
You can use this trick (at least in postgres) with a range of aggregating functions: I often use it to build up an array of values, or sum them.
I just bumped into a VERY similar case, where I had to get seat reservations for events where the reservation status is not cancelled. After trying to figure the problem out for hours, here's what I've seen as the root cause of the problem:
Preface: this is MariaDB, Django 1.11.
When you annotate a query, it gets a GROUP BY clause with the fields you select (basically what's in your values() query selection). After investigating with the MariaDB command line tool why I'm getting NULLs or Nones on the query results, I've came to the conclusion that the GROUP BY clause will cause the COUNT() to return NULLs.
Then, I started diving into the QuerySet interface to see how can I manually, forcibly remove the GROUP BY from the DB queries, and came up with the following code:
from django.db.models.fields import PositiveIntegerField
reserved_seats_qs = SeatReservation.objects.filter(
performance=OuterRef(name='pk'), status__in=TAKEN_TYPES
).values('id').annotate(
count=Count('id')).values('count')
# Query workaround: remove GROUP BY from subquery. Test this
# vigorously!
reserved_seats_qs.query.group_by = []
performances_qs = Performance.objects.annotate(
reserved_seats=Subquery(
queryset=reserved_seats_qs,
output_field=PositiveIntegerField()))
print(performances_qs[0].reserved_seats)
So basically, you have to manually remove/update the group_by field on the subquery's queryset in order for it to not have a GROUP BY appended on it on execution time. Also, you'll have to specify what output field the subquery will have, as it seems that Django fails to recognize it automatically, and raises exceptions on the first evaluation of the queryset. Interestingly, the second evaluation succeeds without it.
I believe this is a Django bug, or an inefficiency in subqueries. I'll create a bug report about it.
Edit: the bug report is here.
Problem
The problem is that Django adds GROUP BY as soon as it sees using an aggregate function.
Solution
So you can just create your own aggregate function but so that Django thinks it is not aggregate. Just like this:
total_comments = Comment.objects.filter(
post=OuterRef('pk')
).order_by().annotate(
total=Func(F('length'), function='SUM')
).values('total')
Post.objects.filter(length__gt=Subquery(total_comments))
This way you get the SQL query like this:
SELECT "testapp_post"."id", "testapp_post"."length"
FROM "testapp_post"
WHERE "testapp_post"."length" > (SELECT SUM(U0."length") AS "total"
FROM "testapp_comment" U0
WHERE U0."post_id" = "testapp_post"."id")
So you can even use aggregate subqueries in aggregate functions.
Example
You can count the number of workdays between two dates, excluding weekends and holidays, and aggregate and summarize them by employee:
class NonWorkDay(models.Model):
date = DateField()
class WorkPeriod(models.Model):
employee = models.ForeignKey(User, on_delete=models.CASCADE)
start_date = DateField()
end_date = DateField()
number_of_non_work_days = NonWorkDay.objects.filter(
date__gte=OuterRef('start_date'),
date__lte=OuterRef('end_date'),
).annotate(
cnt=Func('id', function='COUNT')
).values('cnt')
WorkPeriod.objects.values('employee').order_by().annotate(
number_of_word_days=Sum(F('end_date__year') - F('start_date__year') - number_of_non_work_days)
)
Hope this will help!
A solution which would work for any general aggregation could be implemented using Window classes from Django 2.0. I have added this to the Django tracker ticket as well.
This allows the aggregation of annotated values by calculating the aggregate over partitions based on the outer query model (in the GROUP BY clause), then annotating that data to every row in the subquery queryset. The subquery can then use the aggregated data from the first row returned and ignore the other rows.
Performance.objects.annotate(
reserved_seats=Subquery(
SeatReservation.objects.filter(
performance=OuterRef(name='pk'),
status__in=TAKEN_TYPES,
).annotate(
reserved_seat_count=Window(
expression=Count('pk'),
partition_by=[F('performance')]
),
).values('reserved_seat_count')[:1],
output_field=FloatField()
)
)
If I understand correctly, you are trying to count Spaces available in a Carpark. Subquery seems overkill for this, the good old annotate alone should do the trick:
Carpark.objects.annotate(Count('spaces'))
This will include a spaces__count value in your results.
OK, I have seen your note...
I was also able to run your same query with other models I had at hand. The results are the same, so the query in your example seems to be OK (tested with Django 1.11b1):
activities = Activity.objects.filter(event=OuterRef('pk')).values('event')
count_activities = activities.annotate(c=Count('*')).values('c')
Event.objects.annotate(spaces__count=Subquery(count_activities))
Maybe your "simplest real-world example" is too simple... can you share the models or other information?
"works for me" doesn't help very much. But.
I tried your example on some models I had handy (the Book -> Author type), it works fine for me in django 1.11b1.
Are you sure you're running this in the right version of Django? Is this the actual code you're running? Are you actually testing this not on carpark but some more complex model?
Maybe try to print(thequery.query) to see what SQL it's trying to run in the database. Below is what I got with my models (edited to fit your question):
SELECT (SELECT COUNT(U0."id") AS "c"
FROM "carparks_spaces" U0
WHERE U0."carpark_id" = ("carparks_carpark"."id")
GROUP BY U0."carpark_id") AS "space_count" FROM "carparks_carpark"
Not really an answer, but hopefully it helps.

Mix Count, Case and Distinct on a specific field

I have the following model:
class NoteLink(models.Model):
note_source = models.ForeignKey(Note, related_name="links_sourced")
note_target = models.ForeignKey(Note, related_name="links_targeting")
author = models.ForeignKey(User, related_name="links_created")
is_public = models.BooleanField(default=False)
I would like to count the amounts of links a note has that are public, so I have the following annotate:
Note.objects.annotate(public_links=Count(Case(
When(links_sourced__is_public=True, then=1),
output_field=IntegerField()))
)
The issue is that several users can create a link with the same source and target, and this counts links with the same source and target several time if the author is different. I would like to only count the links with a distinct note_source and note_target.
I know that Count has a distinct=True option. But how can I mix it with my Case to consider links not distinct if all but the author is same? Or in other words, how to only count the ones with a different note_target?
N.B: I am not using PostgreSQL but MySQL, so I cannot run distinct() on a specific field.
Edit: I am not interested into having the count in a separate variable or query. I need this value to be annotate to all my Notes.
Edit2: My goal is to annotate the values into a query that I will use further in my code. Not just to count the amount of distinct note. I already know different ways to do that. What I need is to annotate the Note.objects.all() with the field "public_links" and use the same query later in my code. A separate query containing the number of distinct public links would be non usable for me. The same goes for a query that wouldn't contain all my Notes.
This may help you:-
from django.db.models import Count
distinctNoteCount = Note.objects.filter(is_public=True).annotate(the_count=Count('note_source ',distinct = True))
You can distinct on note_source or note_target because I think it doesn't matter in your case you just want count should be of rows that contains distinct note_source and note_target.

Django Equivalent of SqlAlchemy `any` to filter `WHERE EXISTS`

I have two models, Sample and Run. A Sample can belong to multiple Runs. The Run model has name that I would like to use to filter Samples on; I would like to find all Samples that have a run with a given name filter. In SqlAlchemy, I write this like:
Sample.query.filter(Sample.runs.any(Run.name.like('%test%'))).all()
In Django, I start with:
Sample.objects.filter(run__in=Run.objects.filter(name__icontains='test'))
or
Sample.objects.filter(run__name__icontains='test')
However both of these produce duplicates so I must add .distinct() to the end.
The Django approach of using distinct has terrible performance when there are a large number of predicates (because the distinct operation must runs over a large number of possible rows) whereas the SqlAlchemy runs fine. The repeated rows come from repeated left outer join from each predicate.
For example:
Sample.objects.filter(Q(**{'run__name__icontains': 'alex'}) |
Q(**{'run__name__icontains': 'baz'}) | ...)
EDIT: To make this a little more complicated, I do want the ability to have filters like:
(Q(**{'run__name__icontains': 'alex'}) | Q(**{'name__icontains': 'alex'})
& Q(**{'run__name__icontains': 'baz'}) | Q(**{'name__icontains': 'baz'}))
which has a SQLAlchemy query like:
clause1 = Sample.runs.any(Run.name.like('%alex%')) | Sample.name.like('%test%')
clause2 = Sample.runs.any(Run.name.like('%baz%')) | Sample.name.like('%baz%')
Sample.query.filter(clause1 & clause2)
Assuming this is your models.py:
from django.db import models
class Sample(models.Model):
name = models.CharField(max_length=255)
class Run(models.Model):
name = models.CharField(max_length=255)
sample = models.ForeignKey(Sample)
Since I wasn't able to figure out how to do this without using "distinct", or without using "raw" (in which, if you're forming your own SQL code, and can't rely on the ORM, then what's the point :p), I recommend to try replacing the Django ORM with SQLAlchemy, or use them along-side each other, since theoretically that would work. Sorry I couldn't be of much help :(
Here is a fairly-recent blog post that can help you do that:
http://rodic.fr/blog/sqlalchemy-django/

How do I perform a filter on the FK I'm aggregating on in a QuerySet?

I have a (working) query that looks like
authors = Authors.objects.complicated_queryset()
with_scores = authors.annotate(total_book_score=Sum('books__score'))
It finds all authors who are returned by a complicated_queryset method, and then sums up the total of the scores of their books. However, I wish to amend this QuerySet such that it only includes the scores from the books published the last year. In pretend syntax:
with_scores = authors.annotate(total_book_score=Sum('books__score'),
filter=Q(books__published=2015))
Is this possible with QuerySets or do I have to write raw SQL (or, I guess, two separate queries) to get that behaviour?
You could try using Case if you're using Django 1.8+
DISCLAIMER: The following code is an aproximation, I haven't tested this, so this could not work exactly in this way.
# You will need import:
from django.db.models import Sum, IntegerField, Case, When, Value
with_scores = authors.annotate(total_book_score=Sum(
Case(When(books__published=2015, then=Value(F('books__score'))),
default=Value(0), output=IntegerField()) # Or float if it fits your needs.
)
)

Select DISTINCT individual columns in django?

I'm curious if there's any way to do a query in Django that's not a "SELECT * FROM..." underneath. I'm trying to do a "SELECT DISTINCT columnName FROM ..." instead.
Specifically I have a model that looks like:
class ProductOrder(models.Model):
Product = models.CharField(max_length=20, promary_key=True)
Category = models.CharField(max_length=30)
Rank = models.IntegerField()
where the Rank is a rank within a Category. I'd like to be able to iterate over all the Categories doing some operation on each rank within that category.
I'd like to first get a list of all the categories in the system and then query for all products in that category and repeat until every category is processed.
I'd rather avoid raw SQL, but if I have to go there, that'd be fine. Though I've never coded raw SQL in Django/Python before.
One way to get the list of distinct column names from the database is to use distinct() in conjunction with values().
In your case you can do the following to get the names of distinct categories:
q = ProductOrder.objects.values('Category').distinct()
print q.query # See for yourself.
# The query would look something like
# SELECT DISTINCT "app_productorder"."category" FROM "app_productorder"
There are a couple of things to remember here. First, this will return a ValuesQuerySet which behaves differently from a QuerySet. When you access say, the first element of q (above) you'll get a dictionary, NOT an instance of ProductOrder.
Second, it would be a good idea to read the warning note in the docs about using distinct(). The above example will work but all combinations of distinct() and values() may not.
PS: it is a good idea to use lower case names for fields in a model. In your case this would mean rewriting your model as shown below:
class ProductOrder(models.Model):
product = models.CharField(max_length=20, primary_key=True)
category = models.CharField(max_length=30)
rank = models.IntegerField()
It's quite simple actually if you're using PostgreSQL, just use distinct(columns) (documentation).
Productorder.objects.all().distinct('category')
Note that this feature has been included in Django since 1.4
User order by with that field, and then do distinct.
ProductOrder.objects.order_by('category').values_list('category', flat=True).distinct()
The other answers are fine, but this is a little cleaner, in that it only gives the values like you would get from a DISTINCT query, without any cruft from Django.
>>> set(ProductOrder.objects.values_list('category', flat=True))
{u'category1', u'category2', u'category3', u'category4'}
or
>>> list(set(ProductOrder.objects.values_list('category', flat=True)))
[u'category1', u'category2', u'category3', u'category4']
And, it works without PostgreSQL.
This is less efficient than using a .distinct(), presuming that DISTINCT in your database is faster than a python set, but it's great for noodling around the shell.
Update:
This is answer is great for making queries in the Django shell during development. DO NOT use this solution in production unless you are absolutely certain that you will always have a trivially small number of results before set is applied. Otherwise, it's a terrible idea from a performance standpoint.