Django conditional Subquery aggregate - django

An simplified example of my model structure would be
class Corporation(models.Model):
...
class Division(models.Model):
corporation = models.ForeignKey(Corporation)
class Department(models.Model):
division = models.ForeignKey(Division)
type = models.IntegerField()
Now I want to display a table that display corporations where a column will contain the number of departments of a certain type, e.g. type=10. Currently, this is implemented with a helper on the Corporation model that retrieves those, e.g.
class Corporation(models.Model):
...
def get_departments_type_10(self):
return (
Department.objects
.filter(division__corporation=self, type=10)
.count()
)
The problem here is that this absolutely murders performance due to the N+1 problem.
I have tried to approach this problem with select_related, prefetch_related, annotate, and subquery, but I havn't been able to get the results I need.
Ideally, each Corporation in the queryset should be annotated with an integer type_10_count which reflects the number of departments of that type.
I'm sure I could do something with raw sql in .extra(), but the docs announce that it is going to be deprecated (I'm on Django 1.11)
EDIT: Example of raw sql solution
corps = Corporation.objects.raw("""
SELECT
*,
(
SELECT COUNT(*)
FROM foo_division div ON div.corporation_id = c.id
JOIN foo_department dept ON dept.division_id = div.id
WHERE dept.type = 10
) as type_10_count
FROM foo_corporation c
""")

I think with Subquery we can get SQL similar to one you have provided, with this code
# Get amount of departments with GROUP BY division__corporation [1]
# .order_by() will remove any ordering so we won't get additional GROUP BY columns [2]
departments = Department.objects.filter(type=10).values(
'division__corporation'
).annotate(count=Count('id')).order_by()
# Attach departments as Subquery to Corporation by Corporation.id.
# Departments are already grouped by division__corporation
# so .values('count') will always return single row with single column - count [3]
departments_subquery = departments.filter(division__corporation=OuterRef('id'))
corporations = Corporation.objects.annotate(
departments_of_type_10=Subquery(
departments_subquery.values('count'), output_field=IntegerField()
)
)
The generated SQL is
SELECT "corporation"."id", ... (other fields) ...,
(
SELECT COUNT("division"."id") AS "count"
FROM "department"
INNER JOIN "division" ON ("department"."division_id" = "division"."id")
WHERE (
"department"."type" = 10 AND
"division"."corporation_id" = ("corporation"."id")
) GROUP BY "division"."corporation_id"
) AS "departments_of_type_10"
FROM "corporation"
Some concerns here is that subquery can be slow with large tables. However, database query optimizers can be smart enough to promote subquery to OUTER JOIN, at least I've heard PostgreSQL does this.
1. GROUP BY using .values and .annotate
2. order_by() problems
3. Subquery

You should be able to do this with a Case() expression to query the count of departments that have the type you are looking for:
from django.db.models import Case, IntegerField, Sum, When, Value
Corporation.objects.annotate(
type_10_count=Sum(
Case(
When(division__department__type=10, then=Value(1)),
default=Value(0),
output_field=IntegerField()
)
)
)

I like the following way of doing it:
departments = Department.objects.filter(
type=10,
division__corporation=OuterRef('id')
).annotate(
count=Func('id', 'Count')
).values('count').order_by()
corporations = Corporation.objects.annotate(
departments_of_type_10=Subquery(depatments)
)
The more details on this method you can see in this answer: https://stackoverflow.com/a/69020732/10567223

Related

Count annotation adds unwanted group by statement for all fields

I want to generate the following query:
select id, (select count(*) from B where B.x = A.x) as c from A
Which should be simple enough with the Subquery expression. Except I get a group by statement added to my count query which I can't get rid of:
from django.contrib.contenttypes.models import ContentType
str(ContentType.objects.annotate(c=F('id')).values('c').query)
# completely fine query with annotated field
'SELECT "django_content_type"."id" AS "c" FROM "django_content_type"'
str(ContentType.objects.annotate(c=Count('*')).values('c').query)
# gets group by for every single field out of nowhere
'SELECT COUNT(*) AS "c" FROM "django_content_type" GROUP BY "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model"'
Which makes the result be [{'c': 1}, {'c': 1}, {'c': 1}, {'c': 1},...] instead of [{c:20}]. But subqueries have to have only one row of result to be usable.
Since the query is supposed to be used in a subquery I can't use .count() or .aggregate() either since those evaluate instantly and complain about the usage of OuterRef expression.
Example with subquery:
str(ContentType.objects.annotate(fields=Subquery(
Field.objects.filter(model_id=OuterRef('pk')).annotate(c=Count('*')).values('c')
)).query)
Generates
SELECT "django_content_type"."id",
"django_content_type"."app_label",
"django_content_type"."model",
(SELECT COUNT(*) AS "c"
FROM "meta_field" U0
WHERE U0."model_id" = ("django_content_type"."id")
GROUP BY U0."id", U0."model_id", U0."module", U0."name", U0."label", U0."widget", U0."visible", U0."readonly",
U0."desc", U0."type", U0."type_model_id", U0."type_meta_id", U0."is_type_meta", U0."multi",
U0."translatable", U0."conditions") AS "fields"
FROM "django_content_type"
Expected query:
SELECT "django_content_type"."id",
"django_content_type"."app_label",
"django_content_type"."model",
(SELECT COUNT(*) AS "c"
FROM "meta_field" U0
WHERE U0."model_id" = ("django_content_type"."id")) AS "fields"
FROM "django_content_type"
Update: (to add models from real app requested in comments):
class Translation(models.Model):
field = models.ForeignKey(MetaField, models.CASCADE)
ref_id = models.IntegerField()
# ... other fields
class Choice(models.Model):
meta = models.ForeignKey(MetaField, on_delete=models.PROTECT)
# ... other fields
I need a query to get number of Translations available for each choice where Translation.field_id refers to Choice.meta_id and Translation.ref_id refers to Choice.id.
The reason there are no foreign keys is that not all meta fields are choice fields (e.g. text fields may also have translations). I could make a separate table for each translatable entity, but this setup should be easy to use with a count subquery that doesn't have a group by statement in it.
UPDATE Here's a query using subquery that should come close to what you want:
str(ContentType.objects.annotate(fields=Subquery(
Field.objects.filter(model_id=OuterRef('pk')).values('model').annotate(c=Count('pk')).values('c')
)).query)
The only thing I did was adding the values('model') group_by clause which makes the Count('pk') actually work since it aggregates all rows into one.
It will return null instead of 0 when there are no related rows, which you can probably transform to 0 using a Coalesce function or a Case ... When ... then.
The exact query you want isn't possible with the Django ORM, although you can achieve the same result with
Choice.objects.annotate(c=Count(
'meta__translation',
distinct=True,
filter=Q(meta__translation__ref_id=F('id'))
))
Alternatively look at the django-sql-utils package, as also mentioned in this post.
It is a bit of a dirty hack, but after diving inside Django's ORM code, I found the following works wonderfully for me (I am trying to use your own example's subquery):
counting_subquery = Subquery( Field.objects
.filter( model_id = OuterRef( 'pk' ) )
.annotate( c = Count( '*' ) )
.values('c') )
# Note: the next line fixes a bug in the Django ORM, where the subquery defined above
# triggers an unwanted group_by clause in the generated SQL which ruins the count operation.
counting_subquery.query.group_by = True
results = ContentType.objects
.annotate( fields_count = Subquery( counting_subquery ) )
...
The key is setting group_by to True. That gets rid of the unwanted group_by clause in your SQL.
I am not happy about it, as it relies on Django's undocumented behaviour to work. But I can live with it; I am even less happy about the maintainability of using direct SQL in the subquery...

Using Annotate & Artithmetic in a Django subquery

I am trying to improve my understanding of the Django queryset syntax and am hoping that someone could help me check my understanding.
Could this:
total_packed = (
PackingRecord.objects.filter(
product=OuterRef('pk'), fifolink__sold_out=False
).values('product') # Group by product
.annotate(total=Sum('qty')) # Sum qty for 'each' product
.values('total')
)
total_sold = (
FifoLink.objects.filter(
packing_record__product=OuterRef('pk'), sold_out=False
).values('packing_record__product')
.annotate(total=Sum('sale__qty'))
.values('total')
)
output = obj_set.annotate(
sold=Subquery(total_sold[:1]),
packed=Subquery(total_packed[:1]),
).annotate(
in_stock=F('packed') - F('sold')
)
be safely reduced to this:
in_stock = (
FifoLink.objects.filter(
packing_record__product=OuterRef('pk'), sold_out=False
).values('packing_record__product')
.annotate(total=Sum(F('sale__qty')-F('packing_record__qty')))
.values('total')
)
output = obj_set.annotate(
in_stock=Subquery(total_sold[:1]),
)
Basically, I am trying to move the math being completed in the outer .annotate() into the queryset itself by using the fk relationship instead of running two separate querysets. I think this is allowed, but I am not sure if I am understanding it correctly.

using Filtered Count in django over joined tables returns wrong values

To keep it simple I have four tables(A, B, Category and Relation), Relation table stores the Intensity of A in B and Category stores the type of B.
A <--- Relation ---> B ---> Category
(So the relation between A and B is n to n, where the relation between B and Category is n to 1)
What I need is to calculate the occurrence rate of A in Category which is obtained using:
A.objects.values(
'id', 'relation_set__B__Category_id'
).annotate(
ANum = Count('id', distinct=False)
)
Please notice that If I use 'distinct=True' instead every and each 'Anum' would be equal to 1 which is not the desired outcome. The problem is that I have to filter the calculation based on the dates that B has been occurred on(and some other fields in B table),
I am using django 2.0's feature which makes using filter as an argument in aggregation possible.
Let's assume:
kwargs= {}
kwargs['relation_set__B____BDate__gte'] = the_start_limit
I could use it in my code like:
A.objects.values(
'id', 'relation_set__B__Category_id'
).annotate(
Anum = Count('id', distinct=False, filter=Q(**kwargs))
)
However the result I get is duplicated due to the table joins and I cannot use distinct=True as I explained. (querying A is also a must since I have to aggregate some other fields on this table as explained in my question here)
I am using Postgres and django 2.0.1 .
Is there any workarounds to achieve what I have in mind?
Update
Got it done using another Subquery:
# subquery
annotation = {
'ANum': Count('relation_set__A_id', distinct=False,
filter=Q(**Bkwargs),
}
sub_filter = Q(relation_set__A_id=OuterRef('id')) &
Q(Category_id=OuterRef('relation_set__B__Category_id'))
# you could annotate 'relation_set__B__Category_id' to A query an set the field here.
subquery = B.objects.filter(
sub_filter
).values(
'relation_set__A_id'
).annotate(**annotation).values('ANum')[:1]
# main query
A.objects.values(
'id', 'relation_set__B__Category_id'
).annotate(
Anum = Subquery(subquery)
)
I'm still not sure if I understood what you want. You write
Please notice that If I use 'distinct=True' instead every and each 'Anum' would be equal to 1
Of course. You count the associated A-object to each A-object. Each counts itself. So I still think you don't want to annotate A-objects with Anum, but probably Categories. This one should give you the desired number of As in each Category.
Category.objects.annotate(
Anum=Count(
'b__relation__a',
filter=Q(b__BDate__gte=the_start_limit),
distinct=True
)
)
'b__relation__a' follows the relations backwards and picks all A-objects that are related to the Category. However the filter limits the counted relations to certain Bs. The distinct=True is needed to avoid a query bug.
If you really want "a list of A objects grouped by its id" (and not only the aggregated Anum-count), as you stated in your comment, I don't see an easy way to do that in a single query.

Django ORM: is it possible to inject subqueries?

I have a Django model that looks something like this:
class Result(models.Model):
date = DateTimeField()
subject = models.ForeignKey('myapp.Subject')
test_type = models.ForeignKey('myapp.TestType')
summary = models.PositiveSmallIntegerField()
# more fields about the result like its location, tester ID and so on
Sometimes we want to retrieve all the test results, other times we only want the most recent result of a particular test type for each subject. This answer has some great options for SQL that will find the most recent result.
Also, we sometimes want to bucket the results into different chunks of time so that we can graph the number of results per day / week / month.
We also want to filter on various fields, and for elegance I'd like a QuerySet that I can then make all the filter() calls on, and annotate for the counts, rather than making raw SQL calls.
I have got this far:
qs = Result.objects.extra(select = {
'date_range': "date_trunc('{0}', time)".format("day"), # Chunking into time buckets
'rn' : "ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)"})
qs = qs.values('date_range', 'result_summary', 'rn')
qs = qs.order_by('-date_range')
which results in the following SQL:
SELECT (ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)) AS "rn", (date_trunc('day', time)) AS "date_range", "myapp_result"."result_summary" FROM "myapp_result" ORDER BY "date_range" DESC
which is kind of approaching what I'd like, but now I need to somehow filter to only get the rows where rn = 1. I tried using the 'where' field in extra(), which gives me the following SQL and error:
SELECT (ROW_NUMBER() OVER(PARTITION BY subject_id, test_type_id ORDER BY time DESC)) AS "rn", (date_trunc('day', time)) AS "date_range", "myapp_result"."result_summary" FROM "myapp_result" WHERE "rn"=1 ORDER BY "date_range" DESC ;
ERROR: column "rn" does not exist
So I think the query that finds "rn" needs to be a subquery - but is it possible to do that somehow, perhaps using extra()?
I know I could do this with raw SQL but it just looks ugly! I'd love to find a nice neat way where I have a filterable QuerySet.
I guess the other option is to have a field in the model that indicates whether it is actually the most recent result of that test type for that subject...
I've found a way!
qs = Result.objects.extra(where = ["NOT EXISTS(SELECT * FROM myapp_result as T2 WHERE (T2.test_type_id = myapp_result.test_type_id AND T2.subject_id = myapp_result.subject ID AND T2.time > myapp_result.time))"])
This is based on a different option from the answer I referenced earlier. I can filter or annotate qs with whatever I want.
As an aside, on the way to this solution I tried this:
qq = Result.objects.extra(where = ["NOT EXISTS(SELECT * FROM myapp_result as T2 WHERE (T2.test_type_id = myapp_result.test_type_id AND T2.subject_id = myapp_result.subject ID AND T2.time > myapp_result.time))"])
qs = Result.objects.filter(id__in=qq)
Django embeds the subquery just as you want it to:
SELECT ...some fields... FROM "myapp_result"
WHERE ("myapp_result"."id" IN (SELECT "myapp_result"."id" FROM "myapp_result"
WHERE (NOT EXISTS(SELECT * FROM myapp_result as T2
WHERE (T2.subject_id = myapp_result.subject_id AND T2.test_type_id = myapp_result.test_type_id AND T2.time > myapp_result.time)))))
I realised this had more subqueries than I need, but I note it here as I can imagine it being useful to know that you can filter one queryset with another and Django does exactly what you'd hope for in terms of embedding the subquery (rather than, say, executing it and embedding the returned values, which would be horrid.)

Django query aggregate upvotes in backward relation

I have two models:
Base_Activity:
some fields
User_Activity:
user = models.ForeignKey(settings.AUTH_USER_MODEL)
activity = models.ForeignKey(Base_Activity)
rating = models.IntegerField(default=0) #Will be -1, 0, or 1
Now I want to query Base_Activity, and sort the items that have the most corresponding user activities with rating=1 on top. I want to do something like the query below, but the =1 part is obviously not working.
activities = Base_Activity.objects.all().annotate(
up_votes = Count('user_activity__rating'=1),
).order_by(
'up_votes'
)
How can I solve this?
You cannot use Count like that, as the error message says:
SyntaxError: keyword can't be an expression
The argument of Count must be a simple string, like user_activity__rating.
I think a good alternative can be to use Avg and Count together:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).order_by(
'-a', '-c'
)
The items with the most rating=1 activities should have the highest average, and among the users with the same average the ones with the most activities will be listed higher.
If you want to exclude items that have downvotes, make sure to add the appropriate filter or exclude operations after annotate, for example:
activities = Base_Activity.objects.all().annotate(
a=Avg('user_activity__rating'), c=Count('user_activity__rating')
).filter(user_activity__rating__gt=0).order_by(
'-a', '-c'
)
UPDATE
To get all the items, ordered by their upvotes, disregarding downvotes, I think the only way is to use raw queries, like this:
from django.db import connection
sql = '''
SELECT o.id, SUM(v.rating > 0) s
FROM user_activity o
JOIN rating v ON o.id = v.user_activity_id
GROUP BY o.id ORDER BY s DESC
'''
cursor = connection.cursor()
result = cursor.execute(sql_select)
rows = result.fetchall()
Note: instead of hard-coding the table names of your models, get the table names from the models, for example if your model is called Rating, then you can get its table name with Rating._meta.db_table.
I tested this query on an sqlite3 database, I'm not sure the SUM expression there works in all DBMS. Btw I had a perfect Django site to test, where I also use upvotes and downvotes. I use a very similar model for counting upvotes and downvotes, but I order them by the sum value, stackoverflow style. The site is open-source, if you're interested.