Annotating Django query sets through reverse foreign keys - django

Given a simple set of models as follows:
class A(models.Model):
pass
class B(models.Model):
parent = models.ForeignKey(A, related_name='b_set')
class C(models.Model):
parent = models.ForeignKey(B, related_name='c_set')
I am looking to create a query set of the A model with two annotations. One annotation should be the number of B rows that have the A row in question as their parent. The other annotation should denote the number of B rows, again with the A object in question as parent, which have at least n objects of type C in their c_set.
As an example, consider the following database and n = 3:
Table A
id
0
1
Table B
id parent
0 0
1 0
Table C
id parent
0 0
1 0
2 1
3 1
4 1
I'd like to be able to get a result of the form [(0, 2, 1), (1, 0, 0)] as the A object with id 0 has two B objects of which one has at least three related C objects. The A object with id 1 has no B objects and therefore also no B objects with at least three C rows.
The first annotation is trivial:
A.objects.annotate(annotation_1=Count('b_set'))
What I am trying to design now is the second annotation. I have managed to count the number of B rows per A where the B object has at least a single C object as follows:
A.objects.annotate(annotation_2=Count('b_set__c_set__parent', distinct=True))
But I cannot figure out a way to do it with a minimum related set size other than one. Hopefully someone here can point me in the right direction. One method I was thinking of was somehow annotating the B objects in the query instead of the A rows as is the default of the annotate method but I could not find any resources on this.

This is a complicated query at limits of Django 1.11. I decided to do it by two queries and to combine results to one list that can be used by a view like a queryset:
from django.db.models import Count
sub_qs = (
C.objects
.values('parent')
.annotate(c_count=Count('id'))
.order_by()
.filter(c_count__gte=n)
.values('parent')
)
qs = B.objects.filter(id__in=sub_qs).values('parent_id').annotate(cnt=Count('id'))
qs_map = {x['parent_id']: x['cnt'] for x in qs}
rows = list(A.objects.annotate(annotation_1=Count('b_set')))
for row in rows:
row.annotation_2 = qs_map.get(row.id, 0)
The list rows is the result. The more complicated qs.query is compiled to a relative simple SQL:
>>> print(str(qs.query))
SELECT app_b.parent_id, COUNT(app_b.id) AS cnt
FROM app_b
WHERE app_b.id IN (
SELECT U0.parent_id AS Col1 FROM app_c U0
GROUP BY U0.parent_id HAVING COUNT(U0.id) >= 3
)
GROUP BY app_b.parent_id; -- (added white space and removed double quotes)
This simple solution can be easier modified and tested.
Note: A solution by one query also exists, but doesn't seem useful. Why: It would require Subquery and OuterRef(). They are great, however in general Count() from aggregation is not supported by queries that are compiled together with join resolution. A subquery can be separated by lookup ...__in=... to can be compiled by Django, but then it is not possible to use OuterRef(). If it is written without OuterRef() then it is a so complicated not optimal nested SQL that the time complexity would be probably O(n2) by size of A table for many (or all) database backends. Not tested.

Related

Django Ordering First Column Groupings by Max Value in Second Column Per Group

Given a Model in Django with fields/values as shown:
Group
Votes
A
5
B
2
C
6
B
7
A
3
C
4
it's easy to use
class Meta:
ordering = ['group', '-votes']
to return
Group
Votes
A
5
A
3
B
7
B
2
C
6
C
4
How could this be modified so that the Group is also ordered by the max Votes value within each Group as shown:
Group
Votes
B
7
B
2
C
6
C
4
A
5
A
3
You may use Window functions which will let you to calculate maximum per group for each record and later use it for sorting, like this:
YourModel.objects.annotate(
max_per_group = Window(
expression = Max('votes'),
partition_by = F('group')
)
).order_by(
'-max_per_group','group','-votes'
)
But if you need to place all this inside your model you should use Custom Manager, override get_queryset add Window annotation to queryset and use the annotated value in Meta's ordering, like this:
class YourModelObjectsManager(models.Manager):
def get_queryset(self):
return models.Manager.get_queryset(self).annotate(
max_per_group = Window(
expression = Max('votes'),
partition_by = F('group')
)
)
class YourModel(models.Model):
....
objects = YourModelObjectsManager()
class Meta:
ordering = ['-max_per_group','group','-votes']
But, please, pay attention that including this kind of annotation in your model will lead to additional calculations on the database side each time you access your model (and this will be more significant if your model contains a lot of records). Thus, if ordering is a special case i.e. you need it for example in only one view or just in Django's admin, I would recommend to use annotate and order_by for those special cases.

Django Effecient way to Perform Query on M2M

class A(models.Model)
results = models.TextField()
class B(models.Model)
name = models.CharField(max_length=20)
res = models.ManyToManyField(A)
Let's suppose we have above 2 models. A model has millions of objects.
I would like to know what would be the best efficient/fastest way to get all the results objects of a particular B object.
Let's suppose we have to retrieve all results for object number 5 of B
Option 1 : A.objects.filter(b__id=5)
(OR)
Option 2 : B.objects.get(id=5).res.all()
Option 1: My Question is filtering by id on A model objects would take lot of time? since there are millions of A model objects.
Option 2: Question: does res field on B model stores the id value of A model objects?
The reason why I'm assuming the option 2 would be a faster way since it stores the reference of A model objects & directly getting those object values first and making the second query to fetch the results. whereas in the first option filtering by id or any other field would take up a lot of time
The first expression will result in one database query. Indeed, it will query with:
SELECT a.*
FROM a
INNER JOIN a_b ON a_b.a_id = a.id
WHERE a_b.b_id = 5
The second expression will result in two queries. Indeed, first Django will query to fetch that specific B object with a query like:
SELECT b.*
FROM b
WHERE b.id = 5
then it will make exactly the same query to retrieve the related A objects.
But retrieving the A object is here not necessary (unless you of course need it somewhere else). You thus make a useless database query.
My Question is filtering by id on A model objects would take lot of time? since there are millions of A model objects.
A database normally stores an index on foreign key fields. This thus means that it will filter effectively. The total number of A objects is usually not (that) relevant (since it uses a datastructure to accelerate search like a B-tree [wiki]). The wiki page has a section named An index speeds the search that explains how this works.

Django queryset union appears not to be working when combined with .annotate()

I have the following queryset:
photos = Photo.objects.all()
I filter out two queries:
a = photos.filter(gallery__name='NGL')
b = photos.filter(gallery__name='NGA')
I add them together, and they form one new, bigger queryset:
c = a | b
Indeed, the length of a + b equals c:
a.count() + b.count() == c.count()
>>> True
So far so good. Yet, if I introduce a .annotate(), the | no longer seems to work:
a = photos.annotate(c=Count('label').exclude(c__lte=4)
b = photos.filter(painting=True)
c = a | b
a.count() + b.count() == c.count()
>>> False
How do I combine querysets, even when .annotate() is being used? Note that query one and two both work as intended in isolation, only when combining them using | does it seem to go wrong.
the pipe | or ampersand & to combine querysets actually puts OR or AND to SQL query so it looks like combined.
one = Photo.objects.filter(id=1)
two = Photo.objects.filter(id=2)
combined = one | two
print(combined.query)
>>> ... WHERE ("photo_photo"."id" = 1 OR "photo_photo"."id" = 2)...
But when you combine more filters and excludes you may notice it will give you strange results due to this. So that is why it doesn't match when you compare counts.
If you use .union() you have to have same columns with same data type, so you have to annotate both querysets. Info about .union()
SELECT statement within .UNION() must have the same number of columns
The columns must also have similar data types
The columns in each SELECT statement must also be in the same order
You have to keep in mind, that pythons argument kwargs for indefinite number of arguments are dictionary, so if you want to use annotate with multiple annotations, you can't ensure correct order of columns. Fortunatelly you can solve this with chaining annotate commands.
# printed query of this won't be consistent
photo_queryset.annotate(label_count=Count('labels'), tag_count=Count('tags'))
# this will always have same order of columns
photo_queryset.annotate(label_count=Count('labels')).annotate(tag_count=Count('tags'))
Then you can use .union() and it won't mess up results of annotation. Also .union() should be last method, because after .union() you can't use filter like methods. If you want to preserve duplicates, you use .union(qs, all=True) since .union() has default all=False and calls DISTINCT on queryset
photos = Photo.objects.annotate(c=Count('labels'))
one = photos.exclude(c__lte=4)
two = photos.filter(painting=True)
all = one.union(two, all=True)
one.count() + two.count() == all.count()
>>> True
then it should work like you described in question

Django Query - Multiple Inner Joins

we currently have some issues with building complex Q-object queries with multiple inner joins with django.
The model we want to get (called 'main' in the example) is referenced by another model with a foreign key. The back-reference is called 'related' in the example below. There are many objects of the second model that all refer to the same 'main' object all having ids and values.
We want to get all 'main' objects for wich a related object with id 7113 exists that has the value 1 AND a related object with id 7114 that has the value 0 exists.
This is our current query:
(Q(related__id=u'7313') & Q(related__value=1)) & (Q(related__id=u'7314') & Q(related__value=0))
This code evaluates to
FROM `prefix_main` INNER JOIN `prefix_related` [...] WHERE (`prefix_related`.`id` = 7313 AND `prefix_related`.`value` = 1 AND `prefix_related`.`id` = 7314 AND `prefix_related`.`value` = 0)
What we would need is quite different:
FROM `prefix_main` INNER JOIN `prefix_related` a INNER JOIN `prefix_related` b [...] WHERE (a.`id` = 7313 AND a.`value` = 1 AND b.`id` = 7314 AND b.`value` = 0)
How can I rewrite the ORM query to use two INNER JOINS / use different related instances for the q-objects? Thanks in advance.
i don't think you even need Q-objects for this. You can just use multiple filters like this:
Mainmodel.objects.filter(related__id = 7114, related__value=1).filter(related__id = 7113, related__value=0)
the first filter matches all objects that have a related object with the id 7114 and value 1. The returned objects are filtered once again with the id 7113 and the value 0.

Django: alternative to using annotate(Count()) for speed

There are two models with a one to many relationship, A->{B}. I am counting how many records of A I have with the same B after using a filter(). Then I need to extract the top X records of A in terms of the most B records connected to them.
The current code:
class A(models.Model):
code = models.IntegerField()
...
class B(models.Model):
a = models.ForeignKey(A)
...
data = B.objects.all().filter(...)
top = data.values('a',...).annotate(n=Count('a')).distinct().order_by('-n')[:X];
I have ~300k B records and with my laptop this is taking ~2s for one query. I dissected the query into parts and timed it and it seems the main bottleneck is the annotate().
Is there any way whatsoever to do this faster with Django?
You should add .select_related('a') before annotate in the queryset. This will force django to join the models before counting them.
https://docs.djangoproject.com/en/1.9/ref/models/querysets/#select-related
I suspect the slow down is actually in the DISTINCT, rather than the count.
The way django builds up a query when using queryset.values(x).annotate(...) tells it to group by the first values, and then perform the aggregate.
B.objects.filter(...).values('a').annotate(n=Count('*')).order_by('-n')[:10]
That should generate SQL that looks something like:
SELECT b.a,
count(*) AS n
FROM b
GROUP BY (b.a)
ORDER BY count(*) DESC
LIMIT 10