Django ORM join many to many relation in one query - django

If we have 2 models A, B with a many to many relation.
I want to obtain a sql query similar to this:
SELECT *
FROM a LEFT JOIN ab_relation
ON ab_relation.a_id = a.id
JOIN b ON ab_relation.b_id = b.id;
So in django when I try:
A.objects.prefetch_related('bees')
I get 2 queries similar to:
SELECT * FROM a;
SELECT ab_relation.a_id AS prefetch_related_val_a_id, b.*
FROM b JOIN ab_relation ON b.id = ab_relation.b_id
WHERE ab_relation.a_id IN (123, 456... list of all a.id);
Given that A and B have moderately big tables, I find the way django does it too slow for my needs.
The question is: Is it possible to obtain the left join manually written query through the ORM?
Edits to answer some clarifications:
Yes a LEFT OUTER JOIN would be preferable to get all A's in the queryset, not only those with a relation with B (updated sql).
Moderately big means ~4k rows each, and too slow means ~3 seconds (on first load, before redis cache.) Keep in mind there are other queries on the page.
Actually yes, we need only B.one_field but having tried with Prefetch('bees', queryset=B.objects.values('one_field')) an error said you can't use values in a prefetch.
The queryset will be used as options for a multi-select form-field, where we will need to represent A objects that have a relation with B with an extra string from the B.field.

For the direct answer skip to point 6)
Let'ts talk step by step.
1) N:M select. You say you want a query like this:
SELECT *
FROM a JOIN ab_relation ON ab_relation.a_id = a.id
JOIN b ON ab_relation.b_id = b.id;
But this is not a real N:M query, because you are getting only A-B related objects The query should use outer joins. At least like:
SELECT *
FROM a left outer JOIN
ab_relation ON ab_relation.a_id = a.id left outer JOIN
b ON ab_relation.b_id = b.id;
In other cases you are getting only A models with a related B.
2) Read big tables You say "moderately big tables". Then, are you sure you want to read the whole table from database? This is not usual on a web environment to read a lot of data, and, in this case, you can paginate data. May be is not a web app? Why you need to read this big tables? We need context to answer your question. Are you sure you need all fields from both tables?
3) Select * from Are you sure you need all fields from both tables? May be if you read only some values this query will run faster.
A.objects.values( "some_a_field", "anoter_a_field", "Bs__some_b_field" )
4) As summary. ORM is a powerful tool, two single read operations are "fast". I write some ideas but perhaps we need more context to answer your question. What means moderate big tables, wheat means slow, what are you doing with this data, how many fields or bytes has each row from each table, ... .
Editedd Because OP has edited the question.
5) Use right UI controls. You say:
The queryset will be used as options for a multi-select form-field, where we will need to represent A objects that have a relation with B with an extra string from the B.field.
It looks like an anti-pattern to send to client 4k rows for a form. I suggest to you to move to a live control that loads only needed data. For example, filtering by some text. Take a look to django-select2 awesome project.
6) You say
The question is: Is it possible to obtain the left join manually written query through the ORM?
The answer is: Yes, you can do it using values, as I said it on point 3. Sample: Material and ResultatAprenentatge is a N:M relation:
>>> print( Material
.objects
.values( "titol", "resultats_aprenentatge__codi" )
.query )
The query:
SELECT "material_material"."titol",
"ufs_resultataprenentatge"."codi"
FROM "material_material"
LEFT OUTER JOIN "material_material_resultats_aprenentatge"
ON ( "material_material"."id" =
"material_material_resultats_aprenentatge"."material_id" )
LEFT OUTER JOIN "ufs_resultataprenentatge"
ON (
"material_material_resultats_aprenentatge"."resultataprenentatge_id" =
"ufs_resultataprenentatge"."id" )
ORDER BY "material_material"."data_edicio" DESC

Related

How to join with sub-table using Django's ORM?

I've this query. Orders post records by last comment on post. This query works well with small tables. However, I've filled database with random data approximately 2M rows on comment table. Analyzed query with explain and saw that sequential scan is performed on Post table.
Post.objects.extra(select={'last_update': 'select max(c.create_date) from comment_comment c where c.post_id = post_post.id'}).order_by('-last_update')
I've rewritten same query which is faster than current one. But I could not find a way to fit the query on django's orm. How can I rewrite it? If it is possible, I want to write it not using raw query as much as possible.
Regards. Thanks for any help.
select
p.*,
t.last_update
from
post_post p
join
( select c.post_id as pid, max(c.create_date) as last_update from comment_comment c group by pid) t
on p.id = t.pid
order by t.last_update desc
limit 50;
If I make some assumptions about your Django model, it will look something like this:
posts.objects
.annotate(last_update=Max('comments__create_date'))
.order_by('-last_update')[:50]
In Django, annotate is your friend.

Django annotate and LEFT OUTER JOIN with desired WHERE Clause

Django 1.10.6
Asset.objects.annotate(
coupon_saved=Count(
Q(coupons__device_id='8ae83c6fa52765061360f5459025cb85e6dc8905')
)
).all().query
produces the following query:
SELECT
"assets_asset"."id",
"assets_asset"."title",
"assets_asset"."description",
"assets_asset"."created",
"assets_asset"."modified",
"assets_asset"."uid",
"assets_asset"."org_id",
"assets_asset"."subtitle",
"assets_asset"."is_active",
"assets_asset"."is_generic",
"assets_asset"."file_standalone",
"assets_asset"."file_ios",
"assets_asset"."file_android",
"assets_asset"."file_preview",
"assets_asset"."json_metadata",
"assets_asset"."file_icon",
"assets_asset"."file_image",
"assets_asset"."video_mobile",
"assets_asset"."video_standalone",
"assets_asset"."file_coupon",
"assets_asset"."where_to_buy",
COUNT("games_coupon"."device_id" = 8ae83c6fa52765061360f5459025cb85e6dc8905) AS "coupon_saved"
FROM
"assets_asset"
LEFT OUTER JOIN
"games_coupon"
ON ("assets_asset"."id" = "games_coupon"."asset_id")
GROUP BY
"assets_asset"."id"
I need to get that device_id=X into LEFT OUTER JOIN definition below.
How to achieve?
TL;DR:
The condition should be in filter.
qs = (
Asset.objects
.filter(coupons__device_id='8ae83c6fa52765061360f5459025cb85e6dc8905')
.annotate(coupon_saved=Count('coupons'))
)
If you want only count > 0 then it can be filtered.
qs = qs.filter(coupon_saved__gt=0)
Footnotes: A one to many query is compiled to LEFT OUTER JOIN in order to be possible to get also base objects (Asset) with zero children. JOINs in Django are based every times on a ForeignKey to the primary key or similarly on OneToOne or ManyToMany, other conditions are compiled to WHERE.
Conditions in annotation (that you used) are possible e.g. as part of Conditional Expressions but it is more complicated to be used correctly and useful e.g. if you want to get many aggregations with many conditions by one query without subqueries and if a full scan is acceptable. This is probably not a subject of a question.

Django: alternative to using annotate(Count()) for speed

There are two models with a one to many relationship, A->{B}. I am counting how many records of A I have with the same B after using a filter(). Then I need to extract the top X records of A in terms of the most B records connected to them.
The current code:
class A(models.Model):
code = models.IntegerField()
...
class B(models.Model):
a = models.ForeignKey(A)
...
data = B.objects.all().filter(...)
top = data.values('a',...).annotate(n=Count('a')).distinct().order_by('-n')[:X];
I have ~300k B records and with my laptop this is taking ~2s for one query. I dissected the query into parts and timed it and it seems the main bottleneck is the annotate().
Is there any way whatsoever to do this faster with Django?
You should add .select_related('a') before annotate in the queryset. This will force django to join the models before counting them.
https://docs.djangoproject.com/en/1.9/ref/models/querysets/#select-related
I suspect the slow down is actually in the DISTINCT, rather than the count.
The way django builds up a query when using queryset.values(x).annotate(...) tells it to group by the first values, and then perform the aggregate.
B.objects.filter(...).values('a').annotate(n=Count('*')).order_by('-n')[:10]
That should generate SQL that looks something like:
SELECT b.a,
count(*) AS n
FROM b
GROUP BY (b.a)
ORDER BY count(*) DESC
LIMIT 10

What causes Django ORM to add duplicate tables?

I have a Django generic list view.
So it starts by looking at various request paramaters (list filter, sort) and applys the queryset.filter() method a number of times (or none at all) based on the request parameters.
It then does some aggregates, but the totals are coming out incorrectly. Looking at the query, it seems to be adding various tables to the query two or more times.
So a snippet from the FROM part of the query looks as follows:
INNER JOIN `sequencing_sample` ON (`sequencing_samplesubprojectstats`.`sample_id` = `sequencing_sample`.`id`)
LEFT OUTER JOIN `sequencing_library` ON (`sequencing_sample`.`id` = `sequencing_library`.`sample_id`)
LEFT OUTER JOIN `sequencing_loadedwith` ON (`sequencing_library`.`id` = `sequencing_loadedwith`.`library_id`)
LEFT OUTER JOIN `sequencing_passfail` ON (`sequencing_loadedwith`.`passfail_id` = `sequencing_passfail`.`id`)
LEFT OUTER JOIN `sequencing_passfail` T6 ON (`sequencing_library`.`passfail_id` = T6.`id`)
LEFT OUTER JOIN `sequencing_organism` ON (`sequencing_sample`.`organism_id` = `sequencing_organism`.`id`)
LEFT OUTER JOIN `sequencing_subproject` ON (`sequencing_samplesubprojectstats`.`subproject_id` = `sequencing_subproject`.`id`)
LEFT OUTER JOIN `sequencing_library` T9 ON (`sequencing_sample`.`id` = T9.`sample_id`)
The passfail table is a lookup table, and should be duplicated, but the library table is central to the schema, and should not be duplicated as T9
Is there any good documentation on what causes the ORM to add duplicate tables? There are various thing s happening in the view (various filters being optionally applied, annotations on the queryset).
I can use raw SQL, but I would prefer to use Django objects, as sorting and pagination are a lot easier with these.
I would like to know what part of the API is causing the library table to be added a second time so I can potentially avoid it (if that is possible).

Can I filter on multiple contrains with django aggregate functionlity?

How can I count records with multiple constraints using django's aggregate functionality?
Using django trunk I'm trying to replace a convoluted database-specific SQL statement with django aggregates. As an example, say I have a database structured with tables for blogs running on many domains (think .co.uk, .com, .etc), each taking many comments:
domains <- blog -> comment
The following SQL counts comments on a per-domain basis:
SELECT D.id, COUNT(O.id) as CommentCount FROM domain AS D
LEFT OUTER JOIN blog AS B ON D.blog_id = B.id
LEFT OUTER JOIN comment AS C ON B.id = C.blog_id
GROUP BY D.id
This is easily replicated with:
Domain.objects.annotate(Count('blogs__comments'))
Taking this a step further, I'd like to be able to add one or more constraints and replicate the following SQL:
SELECT D.id, COUNT(O.id) as CommentCount FROM domain AS D
LEFT OUTER JOIN blog AS B ON D.blog_id = B.id
LEFT OUTER JOIN comment AS C ON B.id = C.blog_id
AND C.active = True
GROUP BY D.id
This is much more difficult to replicate as django seems including to filter on the whole shaboodle with a WHERE clause:
Domain.objects.filter(blogs__comments__active=True)
.annotate(Count('blogs__comments'))
SQL comes out something like this:
SELECT ..., COUNT(comment.id) AS blog__comments__count FROM domain
LEFT OUTER JOIN blog ON domain.blog_id = blog.id
LEFT OUTER JOIN comment ON blog.id = comment.blog_id
WHERE comment.active = True
GROUP BY domain.id
ORDER BY NULL
How can I persuade django to pop the extra constraint on the appropriate LEFT OUTER JOIN? This is important as I want to include a count for those blogs with no comments.
I don't know how to do this using the Django query language, but you could always run a raw SQL query. In case you don't already know how to do that, here's an example:
from django.db import connection
def some_method(request, some_parameter):
cursor = connection.cursor()
cursor.execute('SELECT * FROM table WHERE somevar=%s', [some_parameter])
rows = cursor.fetchall()
More detail is available in the Django book online: http://www.djangobook.com/en/2.0/chapter05/
Look for the section "The “Dumb” Way to Do Database Queries in Views". If you don't want to use the "dumb" way, I'm not sure what your options are.