What causes Django ORM to add duplicate tables? - django

I have a Django generic list view.
So it starts by looking at various request paramaters (list filter, sort) and applys the queryset.filter() method a number of times (or none at all) based on the request parameters.
It then does some aggregates, but the totals are coming out incorrectly. Looking at the query, it seems to be adding various tables to the query two or more times.
So a snippet from the FROM part of the query looks as follows:
INNER JOIN `sequencing_sample` ON (`sequencing_samplesubprojectstats`.`sample_id` = `sequencing_sample`.`id`)
LEFT OUTER JOIN `sequencing_library` ON (`sequencing_sample`.`id` = `sequencing_library`.`sample_id`)
LEFT OUTER JOIN `sequencing_loadedwith` ON (`sequencing_library`.`id` = `sequencing_loadedwith`.`library_id`)
LEFT OUTER JOIN `sequencing_passfail` ON (`sequencing_loadedwith`.`passfail_id` = `sequencing_passfail`.`id`)
LEFT OUTER JOIN `sequencing_passfail` T6 ON (`sequencing_library`.`passfail_id` = T6.`id`)
LEFT OUTER JOIN `sequencing_organism` ON (`sequencing_sample`.`organism_id` = `sequencing_organism`.`id`)
LEFT OUTER JOIN `sequencing_subproject` ON (`sequencing_samplesubprojectstats`.`subproject_id` = `sequencing_subproject`.`id`)
LEFT OUTER JOIN `sequencing_library` T9 ON (`sequencing_sample`.`id` = T9.`sample_id`)
The passfail table is a lookup table, and should be duplicated, but the library table is central to the schema, and should not be duplicated as T9
Is there any good documentation on what causes the ORM to add duplicate tables? There are various thing s happening in the view (various filters being optionally applied, annotations on the queryset).
I can use raw SQL, but I would prefer to use Django objects, as sorting and pagination are a lot easier with these.
I would like to know what part of the API is causing the library table to be added a second time so I can potentially avoid it (if that is possible).

Related

Forcing evaluation of prefetched results in django queries

I'm trying to use select_related/prefetch_related to optimize some queries. However I have issues in "forcing" the queries to be evaluated all at once.
Say I'm doing the following:
fp_query = Fourprod.objects.filter(choisi=True).select_related("fk_fournis")
pf = Prefetch("fourprod", queryset=fp_query) #
products = Products.objects.filter(id__in=fp_query).prefetch_related(pf)
With models:
class Fourprod(models.Model):
fk_produit = models.ForeignKey(to=Produit, related_name="fourprod")
fk_fournis = models.ForeignKey(to=Fournis,related_name="fourprod")
choisi = models.BooleanField(...)
class Produit(models.Model):
... ordinary fields...
class Fournis(models.Model):
... ordinary fields...
So essentially, Fourprod has a fk to Fournis, Produit, and I want to prefetch those when I build the Produits queryset. I've checked in debug that the prefetch actually occurs and it does.
I have a bunch of fields from different models I need to use to compute results. I don't really control the table structure, so I have to work with this. I can't come up with a reasonable query to do it all with the queries (or using raw), so I want to compute stuff python-side. It's a few 1000 objects, so reasonable to do in-memory. So I cast to a list to force the query evaluation:
products = list(products)
At this point, I would think that the Products and the related objects that I have pre-fetched should have been fetched from the DB. In the logs, just after the list() call, I get this:
02/08/22 15:21:08 DEBUG DEFAULT: (0.019) SELECT "products_fourprod"."id", "products_fourprod"."fk_produit_id", "products_fourprod"."fk_fournis_id", "products_fourprod"."choisi", "products_fourprod"."code_four", "products_fourprod"."prix", "products_fourprod"."comment", "products_fournis"."id", "products_fournis"."fk_user_create_id", "products_fournis"."nom", "products_fournis"."adresse", "products_fournis"."ville", "products_fournis"."tel", "products_fournis"."fax", "products_fournis"."contact", "products_fournis"."note", "products_fournis"."pays", "products_fournis"."province", "products_fournis"."postal", "products_fournis"."monnaie", "products_fournis"."tel_long", "products_fournis"."inactif", "products_fournis"."inuse", "products_fournis"."par", "products_fournis"."fk_langue", "products_fournis"."NOTE2" FROM "products_fourprod" LEFT OUTER JOIN "products_fournis" ON ("products_fourprod"."fk_fournis_id" = "products_fournis"."id") WHERE ("products_fourprod"."choisi" AND "products_fourprod"."fk_produit_id" IN (... all Product.id meeting the conditions...)
But then, the list comprehension using the products takes forever to complete:
rows = [[p.id, p.fourprod.first().id, p.desuet, p.no_prod, ... ] for p in products]
With apparently each single call to p.fourprod resulting in a DB hit:
02/08/22 15:26:19 DEBUG DEFAULT: (0.000) SELECT "products_fourprod"."id", "products_fourprod"."fk_produit_id", "products_fourprod"."fk_fournis_id", "products_fourprod"."choisi", "products_fourprod"."code_four", "products_fourprod"."prix", "products_fourprod"."comment", "products_fournis"."id", "products_fournis"."fk_user_create_id", "products_fournis"."nom", "products_fournis"."adresse", "products_fournis"."ville", "products_fournis"."tel", "products_fournis"."fax", "products_fournis"."contact", "products_fournis"."note", "products_fournis"."pays", "products_fournis"."province", "products_fournis"."postal", "products_fournis"."monnaie", "products_fournis"."tel_long", "products_fournis"."inactif", "products_fournis"."inuse", "products_fournis"."par", "products_fournis"."fk_langue", "products_fournis"."NOTE2" FROM "products_fourprod" LEFT OUTER JOIN "products_fournis" ON ("products_fourprod"."fk_fournis_id" = "products_fournis"."id") WHERE ("products_fourprod"."choisi" AND "products_fourprod"."fk_produit_id" = 1185) ORDER BY "products_fourprod"."id" ASC LIMIT 1; args=(1185,)
02/08/22 15:26:19 DEBUG DEFAULT: (0.000) SELECT "products_fourprod"."id", (.... more similar db hits... )
If I remove all the uses of related objects, then the list() call has actually forced the db hit already and the query executes quickly.
So.... if simply calling products = list(products) does not force the db to be queried for the prefetched objects as well, is there any ways I can make django's orm do so?
From the docs:
Remember that, as always with QuerySets, any subsequent chained methods which imply a different database query will ignore previously cached results, and retrieve data using a fresh database query.
first() implies a database query, so that will cause your query to not use the prefetched values.
Try to use p.fourprod.all()[0] instead to access the first related fourprod instead.

Django inner join with search terms for tables across many-to-many relationships

I have a dynamic hierarchical advanced search interface I created. Basically, it allows you to search terms, from about 6 or 7 tables that are all linked together, that can be and'ed or or'ed together in any combination. The search entries in the form are all compiled into a complex Q expression.
I discovered a problem today. If I provide a search term for a field in a many-to-many related sub-table, the output table can include results from that table that don't match the term.
My problem can by reproduced in the shell with a simple query:
qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
sqs = qs[0].msrun.sample.animal.studies.all()
sqs.count()
#result: 2
Specifically:
In [3]: qs = PeakGroup.objects.filter(msrun__sample__animal__studies__id__exact=3)
In [12]: ss = s[0].msrun.sample.animal.studies.all()
In [13]: ss[0].__dict__
Out[13]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bfbf940>,
'id': 3,
'name': 'Small OBOB',
'description': ''}
In [14]: ss[1].__dict__
Out[14]:
{'_state': <django.db.models.base.ModelState at 0x7fc12bea81f0>,
'id': 1,
'name': 'obob_fasted',
'description': ''}
The ids in sqs queryset include 1 & 3 even though I only searched on 3. I don't get back literally all studies, so it is filtering some un-matching study records. I understand why I see that, but I don't know how to execute a query that treats it like a join I could perform in SQL where I can restrict the results to only include records that match the query, instead of getting back only records in the root model and gathering everything left-joined to those root model records.
Is there a way to do such an inner join (as the result of a single complex Q expression in a filter) on the entire set of linked tables so that I only get back records that match the M:M field search term?
UPDATE:
By looking at the SQL:
In [3]: str(s.query)
Out[3]: 'SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC'
...I can see that the query is as specific as I would like it to be, but in the template, how do I specify that I want what I would have seen in the SQL result, had I supplied all of the specific related table fields I wanted to see in the output? E.g.:
SELECT "DataRepo_peakgroup"."id", "DataRepo_peakgroup"."name", "DataRepo_peakgroup"."formula", "DataRepo_peakgroup"."msrun_id", "DataRepo_peakgroup"."peak_group_set_id", "DataRepo_animal_studies"."study_id" FROM "DataRepo_peakgroup" INNER JOIN "DataRepo_msrun" ON ("DataRepo_peakgroup"."msrun_id" = "DataRepo_msrun"."id") INNER JOIN "DataRepo_sample" ON ("DataRepo_msrun"."sample_id" = "DataRepo_sample"."id") INNER JOIN "DataRepo_animal" ON ("DataRepo_sample"."animal_id" = "DataRepo_animal"."id") INNER JOIN "DataRepo_animal_studies" ON ("DataRepo_animal"."id" = "DataRepo_animal_studies"."animal_id") WHERE "DataRepo_animal_studies"."study_id" = 3 ORDER BY "DataRepo_peakgroup"."name" ASC
All the Django ORM gives you back from a query (whose filters use fields from many-to-many [henceforth "M:M"] related tables), is a set of records from the "root" table from which the query started. It uses the join logic that you would use in an SQL query, obtaining records using the foreign keys, so you are guaranteed to get back root table records that DO link to a M:M related table record that matches your search term, but when you send the root table records(/queryset) to be rendered in a template and insert a nested loop to access records from the M:M related table, it always gets everything linked to that root table record - whether they match your search term or not, so if you get back a root table record that links to multiple records in an M:M related table, at least 1 record is guaranteed to match your search term, but other records may not.
In order to get the inner join (i.e. only including combined records that match search terms in the table(s) queried), you simply have to roll-your-own, because Django doesn't support it. I accomplished this in the following way:
When rendering search results, wherever you want to include records from a M:M related table, you have to create a nested loop. At the inner-most loop, I essentially re-enforce that the record matches all of the search terms. I.e. I implement my own filter using conditionals on each combination of records (from the root table and from each of the M:M related table). If any record combination does not match, I skip it.
Regarding the particulars as to how I did that, I won't get too into the weeds, but at each nested loop, I maintain a dict of its key path (e.g. msrun__sample__animal__studies) as the key and the current record as the value. I then use a custom simple_tag to re-run the search terms against the current record from each table (by checking the key path of the search term against those available in the dict).
Alternatively, you could do this in the view by compiling all of the matching records and sending the combined large table to the template - but I opted not to do this because of the hurdles surrounding cached_properties and since I already pass the search term form data to re-display the executed search form, the structure of the search was already available.
Watch out #1 - Even without "refiltering" the combined records in the template, note that the record count can be inaccurate when dealing with a combined/composite/joined table. To ensure that the number of records I report in my header above the table was always correct/accurate (i.e. represented the number of rows in my html table), I kept track of the filtered count and used javascript under the table to update the record count reported at the top.
Watch out #2 - There is one other thing to keep in mind when querying using terms from M:M related tables. For every M:M related table record that matches the search term, you will get back a duplicate record from the root table, and there's no way to tell the difference between them (i.e. which M:M record/field value matched in each case. For example, if you matched 2 records from an M:M related table, you would get back 2 identical root table records, and when you pass that data to the template, you would end up with 4 rows in your results, each with the same data from the root table record. To avoid this, all you have to do is append the .distinct() filter to you results queryset.

Django ORM join many to many relation in one query

If we have 2 models A, B with a many to many relation.
I want to obtain a sql query similar to this:
SELECT *
FROM a LEFT JOIN ab_relation
ON ab_relation.a_id = a.id
JOIN b ON ab_relation.b_id = b.id;
So in django when I try:
A.objects.prefetch_related('bees')
I get 2 queries similar to:
SELECT * FROM a;
SELECT ab_relation.a_id AS prefetch_related_val_a_id, b.*
FROM b JOIN ab_relation ON b.id = ab_relation.b_id
WHERE ab_relation.a_id IN (123, 456... list of all a.id);
Given that A and B have moderately big tables, I find the way django does it too slow for my needs.
The question is: Is it possible to obtain the left join manually written query through the ORM?
Edits to answer some clarifications:
Yes a LEFT OUTER JOIN would be preferable to get all A's in the queryset, not only those with a relation with B (updated sql).
Moderately big means ~4k rows each, and too slow means ~3 seconds (on first load, before redis cache.) Keep in mind there are other queries on the page.
Actually yes, we need only B.one_field but having tried with Prefetch('bees', queryset=B.objects.values('one_field')) an error said you can't use values in a prefetch.
The queryset will be used as options for a multi-select form-field, where we will need to represent A objects that have a relation with B with an extra string from the B.field.
For the direct answer skip to point 6)
Let'ts talk step by step.
1) N:M select. You say you want a query like this:
SELECT *
FROM a JOIN ab_relation ON ab_relation.a_id = a.id
JOIN b ON ab_relation.b_id = b.id;
But this is not a real N:M query, because you are getting only A-B related objects The query should use outer joins. At least like:
SELECT *
FROM a left outer JOIN
ab_relation ON ab_relation.a_id = a.id left outer JOIN
b ON ab_relation.b_id = b.id;
In other cases you are getting only A models with a related B.
2) Read big tables You say "moderately big tables". Then, are you sure you want to read the whole table from database? This is not usual on a web environment to read a lot of data, and, in this case, you can paginate data. May be is not a web app? Why you need to read this big tables? We need context to answer your question. Are you sure you need all fields from both tables?
3) Select * from Are you sure you need all fields from both tables? May be if you read only some values this query will run faster.
A.objects.values( "some_a_field", "anoter_a_field", "Bs__some_b_field" )
4) As summary. ORM is a powerful tool, two single read operations are "fast". I write some ideas but perhaps we need more context to answer your question. What means moderate big tables, wheat means slow, what are you doing with this data, how many fields or bytes has each row from each table, ... .
Editedd Because OP has edited the question.
5) Use right UI controls. You say:
The queryset will be used as options for a multi-select form-field, where we will need to represent A objects that have a relation with B with an extra string from the B.field.
It looks like an anti-pattern to send to client 4k rows for a form. I suggest to you to move to a live control that loads only needed data. For example, filtering by some text. Take a look to django-select2 awesome project.
6) You say
The question is: Is it possible to obtain the left join manually written query through the ORM?
The answer is: Yes, you can do it using values, as I said it on point 3. Sample: Material and ResultatAprenentatge is a N:M relation:
>>> print( Material
.objects
.values( "titol", "resultats_aprenentatge__codi" )
.query )
The query:
SELECT "material_material"."titol",
"ufs_resultataprenentatge"."codi"
FROM "material_material"
LEFT OUTER JOIN "material_material_resultats_aprenentatge"
ON ( "material_material"."id" =
"material_material_resultats_aprenentatge"."material_id" )
LEFT OUTER JOIN "ufs_resultataprenentatge"
ON (
"material_material_resultats_aprenentatge"."resultataprenentatge_id" =
"ufs_resultataprenentatge"."id" )
ORDER BY "material_material"."data_edicio" DESC

Django annotate and LEFT OUTER JOIN with desired WHERE Clause

Django 1.10.6
Asset.objects.annotate(
coupon_saved=Count(
Q(coupons__device_id='8ae83c6fa52765061360f5459025cb85e6dc8905')
)
).all().query
produces the following query:
SELECT
"assets_asset"."id",
"assets_asset"."title",
"assets_asset"."description",
"assets_asset"."created",
"assets_asset"."modified",
"assets_asset"."uid",
"assets_asset"."org_id",
"assets_asset"."subtitle",
"assets_asset"."is_active",
"assets_asset"."is_generic",
"assets_asset"."file_standalone",
"assets_asset"."file_ios",
"assets_asset"."file_android",
"assets_asset"."file_preview",
"assets_asset"."json_metadata",
"assets_asset"."file_icon",
"assets_asset"."file_image",
"assets_asset"."video_mobile",
"assets_asset"."video_standalone",
"assets_asset"."file_coupon",
"assets_asset"."where_to_buy",
COUNT("games_coupon"."device_id" = 8ae83c6fa52765061360f5459025cb85e6dc8905) AS "coupon_saved"
FROM
"assets_asset"
LEFT OUTER JOIN
"games_coupon"
ON ("assets_asset"."id" = "games_coupon"."asset_id")
GROUP BY
"assets_asset"."id"
I need to get that device_id=X into LEFT OUTER JOIN definition below.
How to achieve?
TL;DR:
The condition should be in filter.
qs = (
Asset.objects
.filter(coupons__device_id='8ae83c6fa52765061360f5459025cb85e6dc8905')
.annotate(coupon_saved=Count('coupons'))
)
If you want only count > 0 then it can be filtered.
qs = qs.filter(coupon_saved__gt=0)
Footnotes: A one to many query is compiled to LEFT OUTER JOIN in order to be possible to get also base objects (Asset) with zero children. JOINs in Django are based every times on a ForeignKey to the primary key or similarly on OneToOne or ManyToMany, other conditions are compiled to WHERE.
Conditions in annotation (that you used) are possible e.g. as part of Conditional Expressions but it is more complicated to be used correctly and useful e.g. if you want to get many aggregations with many conditions by one query without subqueries and if a full scan is acceptable. This is probably not a subject of a question.

Self join with django ORM

I have a model:
class Trades(models.Model):
userid = models.PositiveIntegerField(null=True, db_index=True)
positionid = models.PositiveIntegerField(db_index=True)
tradeid = models.PositiveIntegerField(db_index=True)
orderid = models.PositiveIntegerField(db_index=True)
...
and I want to execute next query:
select *
from trades t1
inner join trades t2
ON t2.tradeid = t1.positionid and t1.tradeid = t2.positionid
can it be done without hacks using Django ORM?
Thx!
select * ...
will take more work. If you can trim back the columns you want from the right hand side
table=SomeModel._meta.db_table
join_column_1=SomeModel._meta.get_field('field1').column
join_column_2=SomeModel._meta.get_field('field2').column
join_queryset=SomeModel.objects.filter()
# Force evaluation of query
querystr=join_queryset.query.__str__()
# Add promote=True and nullable=True for left outer join
rh_alias=join_queryset.query.join((table,table,join_column_1,join_column_2))
# Add the second conditional and columns
join_queryset=join_queryset.extra(select=dict(rhs_col1='%s.%s' % (rhs,join_column_2)),
where=['%s.%s = %s.%s' % (table,join_column_2,rh_alias,join_column_1)])
Add additional columns to have available to the select dict.
The additional constraints are put together in a WHERE after the ON (), which your SQL engine may optimize poorly.
I believe Django's ORM doesn't support doing a join on anything that isn't specified as a ForeignKey (at least, last time I looked into it, that was a limitation. They're always adding features though, so maybe it snuck in).
So your options are to either re-structure your tables so you can use proper foreign keys, or just do a raw SQL Query.
I wouldn't consider a raw SQL query a "hack". Django has good documentation on how to do raw SQL Queries.