Django annotate and LEFT OUTER JOIN with desired WHERE Clause

Django annotate and LEFT OUTER JOIN with desired WHERE Clause - django

Django 1.10.6
Asset.objects.annotate(
coupon_saved=Count(
Q(coupons__device_id='8ae83c6fa52765061360f5459025cb85e6dc8905')
)
).all().query
produces the following query:
SELECT
"assets_asset"."id",
"assets_asset"."title",
"assets_asset"."description",
"assets_asset"."created",
"assets_asset"."modified",
"assets_asset"."uid",
"assets_asset"."org_id",
"assets_asset"."subtitle",
"assets_asset"."is_active",
"assets_asset"."is_generic",
"assets_asset"."file_standalone",
"assets_asset"."file_ios",
"assets_asset"."file_android",
"assets_asset"."file_preview",
"assets_asset"."json_metadata",
"assets_asset"."file_icon",
"assets_asset"."file_image",
"assets_asset"."video_mobile",
"assets_asset"."video_standalone",
"assets_asset"."file_coupon",
"assets_asset"."where_to_buy",
COUNT("games_coupon"."device_id" = 8ae83c6fa52765061360f5459025cb85e6dc8905) AS "coupon_saved"
FROM
"assets_asset"
LEFT OUTER JOIN
"games_coupon"
ON ("assets_asset"."id" = "games_coupon"."asset_id")
GROUP BY
"assets_asset"."id"
I need to get that device_id=X into LEFT OUTER JOIN definition below.
How to achieve?

TL;DR:
The condition should be in filter.
qs = (
Asset.objects
.filter(coupons__device_id='8ae83c6fa52765061360f5459025cb85e6dc8905')
.annotate(coupon_saved=Count('coupons'))
)
If you want only count > 0 then it can be filtered.
qs = qs.filter(coupon_saved__gt=0)
Footnotes: A one to many query is compiled to LEFT OUTER JOIN in order to be possible to get also base objects (Asset) with zero children. JOINs in Django are based every times on a ForeignKey to the primary key or similarly on OneToOne or ManyToMany, other conditions are compiled to WHERE.
Conditions in annotation (that you used) are possible e.g. as part of Conditional Expressions but it is more complicated to be used correctly and useful e.g. if you want to get many aggregations with many conditions by one query without subqueries and if a full scan is acceptable. This is probably not a subject of a question.

Related

Improve Django queryset performance when using annotate Exists

I have a queryset that returns a lot of data, it can be filtered by year which will return around 100k lines, or show all which will bring around 1 million lines.
The objective of this annotate is to generate a xlsx spreadsheet.
Models representation, RelatedModel is manytomany between Model and AnotherModel
Model:
id
field1
field2
field3
RelatedModel:
foreign_key_model (Model)
foreign_key_another (AnotherModel)
Queryset, if the relation exists it will annotate, this annotate is very slow and can take several minutes.
Model.objects.all().annotate(
related_exists=Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
related_column=Case(
When(related_exists=True, then=Value('The relation exists!')),
When(related_exists=False, then=Value('The relation doesn't exist!')),
default=Value('This is the default value!'),
output_field=CharField(),
)
).values_list(
'related_column',
'field1',
'field2',
'field3'
)

If only thing needed is to change how True / False is displayed in xlsx - one option is to just have one related_exists BooleanField annotation and later customize how it will be converted when creating xlsx document - i.e. in serializer. Database should store raw / unformatted values, and app prepare them to be shown to user.
Other things to consider:
Indexes to speed-up filtering.
If you have millions of records after filtering, in one table - maybe table partitioning could be considered.
But let's look into raw sql of original query. It will be like this:
SELECT [model_fields],
EXISTS([CLIENT_SELECT]) AS related_exists,
CASE
WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation exists!'
WHEN EXISTS([CLIENT_SELECT]) = true THEN 'The relation does not exist!'
ELSE 'The relation exists!'
END AS related_column
FROM model;
And right away we can see nested query for Exists CLIENT_SELECT is there 3 times. Even though it is exactly the same, it may be executed minimum 2 times and up to 3 times. Database may optimize it to be faster than 3x, but it still is not optimal as 1x.
First, EXISTS returns either True or False, we can leave just one check that it is True, making 'The relation does not exist!' the default value.
related_column=Case(
When(related_exists=True, then=Value('The relation exists!')),
default=Value('The relation does not exist!')
Why related_column performs same select again and not takes the value of related_exists?
Because we cannot reference calculated columns while calculating another columns - and this is database level constraint django knows about and duplicates expression.
Wait, then we actually do not need related_exists column, lets just leave related_column with CASE statement and 1 exists subquery.
Here comes Django - we cannot (till 3.0) use expressions in filters without annotating them first.
So, it our case it is like: in order to use Exist in When, we first need to add it as annotation, but it won't be used as a reference, but a full copy of expression.
Good news!
Since Django 3.0 we can use expressions that output BooleanField directly in QuerySet filters, without having to first annotate. Exists is one of such BooleaField expressions.
Model.objects.all().annotate(
related_column=Case(
When(
Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id'))),
then=Value('The relation exists!'),
),
default=Value('The relation doesn't exist!'),
output_field=CharField(),
)
)
And only one nested select, and one annotated field.
Django 2.1, 2.2
Here's the commit that finalized allowance of boolean expressions although many pre-conditions for it were added earlier. One of them is presence of conditional attribute on expression object and check for this attribute.
So, although not recommended and not tested it seems quite working little hack for Django 2.1, 2.2 (before there was no conditional check, and it will require more intrusive changes):
create Exists expression instance
monkey patch it with conditional = True
use it as condition in When statement
related_model_exists = Exists(RelatedModel.objects.filter(foreign_key_model=OuterRef('id')))
setattr(related_model_exists, 'conditional', True)
Model.objects.all().annotate(
related_column=Case(
When(
relate_model_exists,
then=Value('The relation exists!'),
),
default=Value('The relation doesn't exist!'),
output_field=CharField(),
)
)
Related checks
relatedmodel_set__isnull=True check is not suitable for several reasons:
it performs LEFT OUTER JOIN - that is less efficient than EXISTS
it performs LEFT OUTER JOIN - it joins tables, this makes it ONLY suitable in filter() condition (not in annotate - When), and only for OneToOne or OneToMany (One is on relatedmodel side) relations

You can considerably simplify your query to:
from django.db.models import Count
Model.objects.all().annotate(
related_column=Case(
When(relatedmodel_set__isnull=True, then=Value("The relation doesn't exist!")),
default=Value("The relation exists!"),
output_field=CharField()
)
)
Where relatedmodel_set is the related_name on your foreign key.

Filtering together with annotate in Django - unexpected result [duplicate]

I always assumed that chaining multiple filter() calls in Django was always the same as collecting them in a single call.
# Equivalent
Model.objects.filter(foo=1).filter(bar=2)
Model.objects.filter(foo=1,bar=2)
but I have run across a complicated queryset in my code where this is not the case
class Inventory(models.Model):
book = models.ForeignKey(Book)
class Profile(models.Model):
user = models.OneToOneField(auth.models.User)
vacation = models.BooleanField()
country = models.CharField(max_length=30)
# Not Equivalent!
Book.objects.filter(inventory__user__profile__vacation=False).filter(inventory__user__profile__country='BR')
Book.objects.filter(inventory__user__profile__vacation=False, inventory__user__profile__country='BR')
The generated SQL is
SELECT "library_book"."id", "library_book"."asin", "library_book"."added", "library_book"."updated" FROM "library_book" INNER JOIN "library_inventory" ON ("library_book"."id" = "library_inventory"."book_id") INNER JOIN "auth_user" ON ("library_inventory"."user_id" = "auth_user"."id") INNER JOIN "library_profile" ON ("auth_user"."id" = "library_profile"."user_id") INNER JOIN "library_inventory" T5 ON ("library_book"."id" = T5."book_id") INNER JOIN "auth_user" T6 ON (T5."user_id" = T6."id") INNER JOIN "library_profile" T7 ON (T6."id" = T7."user_id") WHERE ("library_profile"."vacation" = False AND T7."country" = BR )
SELECT "library_book"."id", "library_book"."asin", "library_book"."added", "library_book"."updated" FROM "library_book" INNER JOIN "library_inventory" ON ("library_book"."id" = "library_inventory"."book_id") INNER JOIN "auth_user" ON ("library_inventory"."user_id" = "auth_user"."id") INNER JOIN "library_profile" ON ("auth_user"."id" = "library_profile"."user_id") WHERE ("library_profile"."vacation" = False AND "library_profile"."country" = BR )
The first queryset with the chained filter() calls joins the Inventory model twice effectively creating an OR between the two conditions whereas the second queryset ANDs the two conditions together. I was expecting that the first query would also AND the two conditions. Is this the expected behavior or is this a bug in Django?
The answer to a related question Is there a downside to using ".filter().filter().filter()..." in Django? seems to indicated that the two querysets should be equivalent.

The way I understand it is that they are subtly different by design (and I am certainly open for correction): filter(A, B) will first filter according to A and then subfilter according to B, while filter(A).filter(B) will return a row that matches A 'and' a potentially different row that matches B.
Look at the example here:
https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships
particularly:
Everything inside a single filter() call is applied simultaneously to filter out items matching all those requirements. Successive filter() calls further restrict the set of objects
...
In this second example (filter(A).filter(B)), the first filter restricted the queryset to (A). The second filter restricted the set of blogs further to those that are also (B). The entries select by the second filter may or may not be the same as the entries in the first filter.`

These two style of filtering are equivalent in most cases, but when query on objects base on ForeignKey or ManyToManyField, they are slightly different.
Examples from the documentation.
model
Blog to Entry is a one-to-many relation.
from django.db import models
class Blog(models.Model):
...
class Entry(models.Model):
blog = models.ForeignKey(Blog)
headline = models.CharField(max_length=255)
pub_date = models.DateField()
...
objects
Assuming there are some blog and entry objects here.
queries
Blog.objects.filter(entry__headline_contains='Lennon',
entry__pub_date__year=2008)
Blog.objects.filter(entry__headline_contains='Lennon').filter(
entry__pub_date__year=2008)
For the 1st query (single filter one), it match only blog1.
For the 2nd query (chained filters one), it filters out blog1 and blog2.
The first filter restricts the queryset to blog1, blog2 and blog5; the second filter restricts the set of blogs further to blog1 and blog2.
And you should realize that
We are filtering the Blog items with each filter statement, not the Entry items.
So, it's not the same, because Blog and Entry are multi-valued relationships.
Reference: https://docs.djangoproject.com/en/1.8/topics/db/queries/#spanning-multi-valued-relationships
If there is something wrong, please correct me.
Edit: Changed v1.6 to v1.8 since the 1.6 links are no longer available.

As you can see in the generated SQL statements the difference is not the "OR" as some may suspect. It is how the WHERE and JOIN is placed.
Example1 (same joined table) :
(example from https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships)
Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)
This will give you all the Blogs that have one entry with both (entry_headline_contains='Lennon') AND (entry__pub_date__year=2008), which is what you would expect from this query.
Result:
Book with {entry.headline: 'Life of Lennon', entry.pub_date: '2008'}
Example 2 (chained)
Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)
This will cover all the results from Example 1, but it will generate slightly more result. Because it first filters all the blogs with (entry_headline_contains='Lennon') and then from the result filters (entry__pub_date__year=2008).
The difference is that it will also give you results like:
Book with {entry.headline: 'Lennon', entry.pub_date: 2000}, {entry.headline: 'Bill', entry.pub_date: 2008}
In your case
I think it is this one you need:
Book.objects.filter(inventory__user__profile__vacation=False, inventory__user__profile__country='BR')
And if you want to use OR please read: https://docs.djangoproject.com/en/dev/topics/db/queries/#complex-lookups-with-q-objects

From Django docs :
To handle both of these situations, Django has a consistent way of processing filter() calls. Everything inside a single filter() call is applied simultaneously to filter out items matching all those requirements. Successive filter() calls further restrict the set of objects, but for multi-valued relations, they apply to any object linked to the primary model, not necessarily those objects that were selected by an earlier filter() call.
It is clearly said that multiple conditions in a single filter() are applied simultaneously.
That means that doing :
objs = Mymodel.objects.filter(a=True, b=False)
will return a queryset with raws from model Mymodel where a=True AND b=False.
Successive filter(), in some case, will provide the same result. Doing :
objs = Mymodel.objects.filter(a=True).filter(b=False)
will return a queryset with raws from model Mymodel where a=True AND b=False too. Since you obtain "first" a queryset with records which have a=True and then it's restricted to those who have b=False at the same time.
The difference in chaining filter() comes when there are multi-valued relations, which means you are going through other models (such as the example given in the docs, between Blog and Entry models). It is said that in that case (...) they apply to any object linked to the primary model, not necessarily those objects that were selected by an earlier filter() call.
Which means that it applies the successives filter() on the target model directly, not on previous filter()
If I take the example from the docs :
Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)
remember that it's the model Blog that is filtered, not the Entry. So it will treat the 2 filter() independently.
It will, for instance, return a queryset with Blogs, that have entries that contain 'Lennon' (even if they are not from 2008) and entries that are from 2008 (even if their headline does not contain 'Lennon')
THIS ANSWER goes even further in the explanation. And the original question is similar.

Sometimes you don't want to join multiple filters together like this:
def your_dynamic_query_generator(self, event: Event):
qs \
.filter(shiftregistrations__event=event) \
.filter(shiftregistrations__shifts=False)
And the following code would actually not return the correct thing.
def your_dynamic_query_generator(self, event: Event):
return Q(shiftregistrations__event=event) & Q(shiftregistrations__shifts=False)
What you can do now is to use an annotation count-filter.
In this case we count all shifts which belongs to a certain event.
qs: EventQuerySet = qs.annotate(
num_shifts=Count('shiftregistrations__shifts', filter=Q(shiftregistrations__event=event))
)
Afterwards you can filter by annotation.
def your_dynamic_query_generator(self):
return Q(num_shifts=0)
This solution is also cheaper on large querysets.
Hope this helps.

Saw this in a comment and I thought it was the simplest explanation.
filter(A, B) is the AND ; filter(A).filter(B) is OR
It's true if every linked model satisfies both conditions

What causes Django ORM to add duplicate tables?

I have a Django generic list view.
So it starts by looking at various request paramaters (list filter, sort) and applys the queryset.filter() method a number of times (or none at all) based on the request parameters.
It then does some aggregates, but the totals are coming out incorrectly. Looking at the query, it seems to be adding various tables to the query two or more times.
So a snippet from the FROM part of the query looks as follows:
INNER JOIN `sequencing_sample` ON (`sequencing_samplesubprojectstats`.`sample_id` = `sequencing_sample`.`id`)
LEFT OUTER JOIN `sequencing_library` ON (`sequencing_sample`.`id` = `sequencing_library`.`sample_id`)
LEFT OUTER JOIN `sequencing_loadedwith` ON (`sequencing_library`.`id` = `sequencing_loadedwith`.`library_id`)
LEFT OUTER JOIN `sequencing_passfail` ON (`sequencing_loadedwith`.`passfail_id` = `sequencing_passfail`.`id`)
LEFT OUTER JOIN `sequencing_passfail` T6 ON (`sequencing_library`.`passfail_id` = T6.`id`)
LEFT OUTER JOIN `sequencing_organism` ON (`sequencing_sample`.`organism_id` = `sequencing_organism`.`id`)
LEFT OUTER JOIN `sequencing_subproject` ON (`sequencing_samplesubprojectstats`.`subproject_id` = `sequencing_subproject`.`id`)
LEFT OUTER JOIN `sequencing_library` T9 ON (`sequencing_sample`.`id` = T9.`sample_id`)
The passfail table is a lookup table, and should be duplicated, but the library table is central to the schema, and should not be duplicated as T9
Is there any good documentation on what causes the ORM to add duplicate tables? There are various thing s happening in the view (various filters being optionally applied, annotations on the queryset).
I can use raw SQL, but I would prefer to use Django objects, as sorting and pagination are a lot easier with these.
I would like to know what part of the API is causing the library table to be added a second time so I can potentially avoid it (if that is possible).

Chaining multiple filter() in Django, is this a bug?

I always assumed that chaining multiple filter() calls in Django was always the same as collecting them in a single call.
# Equivalent
Model.objects.filter(foo=1).filter(bar=2)
Model.objects.filter(foo=1,bar=2)
but I have run across a complicated queryset in my code where this is not the case
class Inventory(models.Model):
book = models.ForeignKey(Book)
class Profile(models.Model):
user = models.OneToOneField(auth.models.User)
vacation = models.BooleanField()
country = models.CharField(max_length=30)
# Not Equivalent!
Book.objects.filter(inventory__user__profile__vacation=False).filter(inventory__user__profile__country='BR')
Book.objects.filter(inventory__user__profile__vacation=False, inventory__user__profile__country='BR')
The generated SQL is
SELECT "library_book"."id", "library_book"."asin", "library_book"."added", "library_book"."updated" FROM "library_book" INNER JOIN "library_inventory" ON ("library_book"."id" = "library_inventory"."book_id") INNER JOIN "auth_user" ON ("library_inventory"."user_id" = "auth_user"."id") INNER JOIN "library_profile" ON ("auth_user"."id" = "library_profile"."user_id") INNER JOIN "library_inventory" T5 ON ("library_book"."id" = T5."book_id") INNER JOIN "auth_user" T6 ON (T5."user_id" = T6."id") INNER JOIN "library_profile" T7 ON (T6."id" = T7."user_id") WHERE ("library_profile"."vacation" = False AND T7."country" = BR )
SELECT "library_book"."id", "library_book"."asin", "library_book"."added", "library_book"."updated" FROM "library_book" INNER JOIN "library_inventory" ON ("library_book"."id" = "library_inventory"."book_id") INNER JOIN "auth_user" ON ("library_inventory"."user_id" = "auth_user"."id") INNER JOIN "library_profile" ON ("auth_user"."id" = "library_profile"."user_id") WHERE ("library_profile"."vacation" = False AND "library_profile"."country" = BR )
The first queryset with the chained filter() calls joins the Inventory model twice effectively creating an OR between the two conditions whereas the second queryset ANDs the two conditions together. I was expecting that the first query would also AND the two conditions. Is this the expected behavior or is this a bug in Django?
The answer to a related question Is there a downside to using ".filter().filter().filter()..." in Django? seems to indicated that the two querysets should be equivalent.

The way I understand it is that they are subtly different by design (and I am certainly open for correction): filter(A, B) will first filter according to A and then subfilter according to B, while filter(A).filter(B) will return a row that matches A 'and' a potentially different row that matches B.
Look at the example here:
https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships
particularly:
Everything inside a single filter() call is applied simultaneously to filter out items matching all those requirements. Successive filter() calls further restrict the set of objects
...
In this second example (filter(A).filter(B)), the first filter restricted the queryset to (A). The second filter restricted the set of blogs further to those that are also (B). The entries select by the second filter may or may not be the same as the entries in the first filter.`

These two style of filtering are equivalent in most cases, but when query on objects base on ForeignKey or ManyToManyField, they are slightly different.
Examples from the documentation.
model
Blog to Entry is a one-to-many relation.
from django.db import models
class Blog(models.Model):
...
class Entry(models.Model):
blog = models.ForeignKey(Blog)
headline = models.CharField(max_length=255)
pub_date = models.DateField()
...
objects
Assuming there are some blog and entry objects here.
queries
Blog.objects.filter(entry__headline_contains='Lennon',
entry__pub_date__year=2008)
Blog.objects.filter(entry__headline_contains='Lennon').filter(
entry__pub_date__year=2008)
For the 1st query (single filter one), it match only blog1.
For the 2nd query (chained filters one), it filters out blog1 and blog2.
The first filter restricts the queryset to blog1, blog2 and blog5; the second filter restricts the set of blogs further to blog1 and blog2.
And you should realize that
We are filtering the Blog items with each filter statement, not the Entry items.
So, it's not the same, because Blog and Entry are multi-valued relationships.
Reference: https://docs.djangoproject.com/en/1.8/topics/db/queries/#spanning-multi-valued-relationships
If there is something wrong, please correct me.
Edit: Changed v1.6 to v1.8 since the 1.6 links are no longer available.

As you can see in the generated SQL statements the difference is not the "OR" as some may suspect. It is how the WHERE and JOIN is placed.
Example1 (same joined table) :
(example from https://docs.djangoproject.com/en/dev/topics/db/queries/#spanning-multi-valued-relationships)
Blog.objects.filter(entry__headline__contains='Lennon', entry__pub_date__year=2008)
This will give you all the Blogs that have one entry with both (entry_headline_contains='Lennon') AND (entry__pub_date__year=2008), which is what you would expect from this query.
Result:
Book with {entry.headline: 'Life of Lennon', entry.pub_date: '2008'}
Example 2 (chained)
Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)
This will cover all the results from Example 1, but it will generate slightly more result. Because it first filters all the blogs with (entry_headline_contains='Lennon') and then from the result filters (entry__pub_date__year=2008).
The difference is that it will also give you results like:
Book with {entry.headline: 'Lennon', entry.pub_date: 2000}, {entry.headline: 'Bill', entry.pub_date: 2008}
In your case
I think it is this one you need:
Book.objects.filter(inventory__user__profile__vacation=False, inventory__user__profile__country='BR')
And if you want to use OR please read: https://docs.djangoproject.com/en/dev/topics/db/queries/#complex-lookups-with-q-objects

From Django docs :
To handle both of these situations, Django has a consistent way of processing filter() calls. Everything inside a single filter() call is applied simultaneously to filter out items matching all those requirements. Successive filter() calls further restrict the set of objects, but for multi-valued relations, they apply to any object linked to the primary model, not necessarily those objects that were selected by an earlier filter() call.
It is clearly said that multiple conditions in a single filter() are applied simultaneously.
That means that doing :
objs = Mymodel.objects.filter(a=True, b=False)
will return a queryset with raws from model Mymodel where a=True AND b=False.
Successive filter(), in some case, will provide the same result. Doing :
objs = Mymodel.objects.filter(a=True).filter(b=False)
will return a queryset with raws from model Mymodel where a=True AND b=False too. Since you obtain "first" a queryset with records which have a=True and then it's restricted to those who have b=False at the same time.
The difference in chaining filter() comes when there are multi-valued relations, which means you are going through other models (such as the example given in the docs, between Blog and Entry models). It is said that in that case (...) they apply to any object linked to the primary model, not necessarily those objects that were selected by an earlier filter() call.
Which means that it applies the successives filter() on the target model directly, not on previous filter()
If I take the example from the docs :
Blog.objects.filter(entry__headline__contains='Lennon').filter(entry__pub_date__year=2008)
remember that it's the model Blog that is filtered, not the Entry. So it will treat the 2 filter() independently.
It will, for instance, return a queryset with Blogs, that have entries that contain 'Lennon' (even if they are not from 2008) and entries that are from 2008 (even if their headline does not contain 'Lennon')
THIS ANSWER goes even further in the explanation. And the original question is similar.

Sometimes you don't want to join multiple filters together like this:
def your_dynamic_query_generator(self, event: Event):
qs \
.filter(shiftregistrations__event=event) \
.filter(shiftregistrations__shifts=False)
And the following code would actually not return the correct thing.
def your_dynamic_query_generator(self, event: Event):
return Q(shiftregistrations__event=event) & Q(shiftregistrations__shifts=False)
What you can do now is to use an annotation count-filter.
In this case we count all shifts which belongs to a certain event.
qs: EventQuerySet = qs.annotate(
num_shifts=Count('shiftregistrations__shifts', filter=Q(shiftregistrations__event=event))
)
Afterwards you can filter by annotation.
def your_dynamic_query_generator(self):
return Q(num_shifts=0)
This solution is also cheaper on large querysets.
Hope this helps.

Saw this in a comment and I thought it was the simplest explanation.
filter(A, B) is the AND ; filter(A).filter(B) is OR
It's true if every linked model satisfies both conditions

Annotating a Django queryset with a left outer join?

Say I have a model:
class Foo(models.Model):
...
and another model that basically gives per-user information about Foo:
class UserFoo(models.Model):
user = models.ForeignKey(User)
foo = models.ForeignKey(Foo)
...
class Meta:
unique_together = ("user", "foo")
I'd like to generate a queryset of Foos but annotated with the (optional) related UserFoo based on user=request.user.
So it's effectively a LEFT OUTER JOIN on (foo.id = userfoo.foo_id AND userfoo.user_id = ...)

A solution with raw might look like
foos = Foo.objects.raw("SELECT foo.* FROM foo LEFT OUTER JOIN userfoo ON (foo.id = userfoo.foo_id AND foo.user_id = %s)", [request.user.id])
You'll need to modify the SELECT to include extra fields from userfoo which will be annotated to the resulting Foo instances in the queryset.

This answer might not be exactly what you are looking for but since its the first result in google when searching for "django annotate outer join" so I will post it here.
Note: tested on Djang 1.7
Suppose you have the following models
class User(models.Model):
name = models.CharField()
class EarnedPoints(models.Model):
points = models.PositiveIntegerField()
user = models.ForeignKey(User)
To get total user points you might do something like that
User.objects.annotate(points=Sum("earned_points__points"))
this will work but it will not return users who have no points, here we need outer join without any direct hacks or raw sql
You can achieve that by doing this
users_with_points = User.objects.annotate(points=Sum("earned_points__points"))
result = users_with_points | User.objects.exclude(pk__in=users_with_points)
This will be translated into OUTER LEFT JOIN and all users will be returned. users who has no points will have None value in their points attribute.
Hope that helps

Notice: This method does not work in Django 1.6+. As explained in tcarobruce's comment below, the promote argument was removed as part of ticket #19849: ORM Cleanup.
Django doesn't provide an entirely built-in way to do this, but it's not neccessary to construct an entirely raw query. (This method doesn't work for selecting * from UserFoo, so I'm using .comment as an example field to include from UserFoo.)
The QuerySet.extra() method allows us to add terms to the SELECT and WHERE clauses of our query. We use this to include the fields from UserFoo table in our results, and limit our UserFoo matches to the current user.
results = Foo.objects.extra(
select={"user_comment": "UserFoo.comment"},
where=["(UserFoo.user_id IS NULL OR UserFoo.user_id = %s)"],
params=[request.user.id]
)
This query still needs the UserFoo table. It would be possible to use .extras(tables=...) to get an implicit INNER JOIN, but for an OUTER JOIN we need to modify the internal query object ourself.
connection = (
UserFoo._meta.db_table, User._meta.db_table, # JOIN these tables
"user_id", "id", # on these fields
)
results.query.join( # modify the query
connection, # with this table connection
promote=True, # as LEFT OUTER JOIN
)
We can now evaluate the results. Each instance will have a .user_comment property containing the value from UserFoo, or None if it doesn't exist.
print results[0].user_comment
(Credit to this blog post by Colin Copeland for showing me how to do OUTER JOINs.)

I stumbled upon this problem I was unable to solve without resorting to raw SQL, but I did not want to rewrite the entire query.
Following is a description on how you can augment a queryset with an external raw sql, without having to care about the actual query that generates the queryset.
Here's a typical scenario: You have a reddit like site with a LinkPost model and a UserPostVote mode, like this:
class LinkPost(models.Model):
some fields....
class UserPostVote(models.Model):
user = models.ForeignKey(User,related_name="post_votes")
post = models.ForeignKey(LinkPost,related_name="user_votes")
value = models.IntegerField(null=False, default=0)
where the userpostvote table collect's the votes of users on posts.
Now you're trying to display the front page for a user with a pagination app, but you want the arrows to be red for posts the user has voted on.
First you get the posts for the page:
post_list = LinkPost.objects.all()
paginator = Paginator(post_list,25)
posts_page = paginator.page(request.GET.get('page'))
so now you have a QuerySet posts_page generated by the django paginator that selects the posts to display. How do we now add the annotation of the user's vote on each post before rendering it in a template?
Here's where it get's tricky and I was unable to find a clean ORM solution. select_related won't allow you to only get votes corresponding to the logged in user and looping over the posts would do bunch queries instead of one and doing it all raw mean's we can't use the queryset from the pagination app.
So here's how I do it:
q1 = posts_page.object_list.query # The query object of the queryset
q1_alias = q1.get_initial_alias() # This forces the query object to generate it's sql
(q1str, q1param) = q1.sql_with_params() #This gets the sql for the query along with
#parameters, which are none in this example
we now have the query for the queryset, and just wrap it, alias and left outer join to it:
q2_augment = "SELECT B.value as uservote, A.*
from ("+q1str+") A LEFT OUTER JOIN reddit_userpostvote B
ON A.id = B.post_id AND B.user_id = %s"
q2param = (request.user.id,)
posts_augmented = LinkPost.objects.raw(q2_augment,q1param+q2param)
voila! Now we can access post.uservote for a post in the augmented queryset.
And we just hit the database with a single query.

The two queries you suggest are as good as you're going to get (without using raw()), this type of query isn't representable in the ORM at present time.

You could do this using simonw's django-queryset-transform to avoid hard-coding a raw SQL query - the code would look something like this:
def userfoo_retriever(qs):
userfoos = dict((i.pk, i) for i in UserFoo.objects.filter(foo__in=qs))
for i in qs:
i.userfoo = userfoos.get(i.pk, None)
for foo in Foo.objects.filter(…).tranform(userfoo_retriever):
print foo.userfoo
This approach has been quite successful for this need and to efficiently retrieve M2M values; your query count won't be quite as low but on certain databases (cough MySQL cough) doing two simpler queries can often be faster than one with complex JOINs and many of the cases where I've most needed it had additional complexity which would have been even harder to hack into an ORM expression.

As for outerjoins:
Once you have a queryset qs from foo that includes a reference to columns from userfoo, you can promote the inner join to an outer join with
qs.query.promote_joins(["userfoo"])

You shouldn't have to resort to extra or raw for this.
The following should work.
Foo.objects.filter(
Q(userfoo_set__user=request.user) |
Q(userfoo_set=None) # This forces the use of LOUTER JOIN.
).annotate(
comment=F('userfoo_set__comment'),
# ... annotate all the fields you'd like to see added here.
)

The only way I see to do this without using raw etc. is something like this:
Foo.objects.filter(
Q(userfoo_set__isnull=True)|Q(userfoo_set__isnull=False)
).annotate(bar=Case(
When(userfoo_set__user_id=request.user, then='userfoo_set__bar')
))
The double Q trick ensures that you get your left outer join.
Unfortunately you can't set your request.user condition in the filter() since it may filter out successful joins on UserFoo instances with the wrong user, hence filtering out rows of Foo that you wanted to keep (which is why you ideally want the condition in the ON join clause instead of in the WHERE clause).
Because you can't filter out the rows that have an unwanted user value, you have to select rows from UserFoo with a CASE.
Note also that one Foo may join to many UserFoo records, so you may want to consider some way to retrieve distinct Foos from the output.

maparent's comment put me on the right way:
from django.db.models.sql.datastructures import Join
for alias in qs.query.alias_map.values():
if isinstance(alias, Join):
alias.nullable = True
qs.query.promote_joins(qs.query.tables)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django annotate and LEFT OUTER JOIN with desired WHERE Clause - django

Related

Improve Django queryset performance when using annotate Exists

Filtering together with annotate in Django - unexpected result [duplicate]

What causes Django ORM to add duplicate tables?

Chaining multiple filter() in Django, is this a bug?

Annotating a Django queryset with a left outer join?

Categories

Resources