Django Query - Multiple Inner Joins - django

we currently have some issues with building complex Q-object queries with multiple inner joins with django.
The model we want to get (called 'main' in the example) is referenced by another model with a foreign key. The back-reference is called 'related' in the example below. There are many objects of the second model that all refer to the same 'main' object all having ids and values.
We want to get all 'main' objects for wich a related object with id 7113 exists that has the value 1 AND a related object with id 7114 that has the value 0 exists.
This is our current query:
(Q(related__id=u'7313') & Q(related__value=1)) & (Q(related__id=u'7314') & Q(related__value=0))
This code evaluates to
FROM `prefix_main` INNER JOIN `prefix_related` [...] WHERE (`prefix_related`.`id` = 7313 AND `prefix_related`.`value` = 1 AND `prefix_related`.`id` = 7314 AND `prefix_related`.`value` = 0)
What we would need is quite different:
FROM `prefix_main` INNER JOIN `prefix_related` a INNER JOIN `prefix_related` b [...] WHERE (a.`id` = 7313 AND a.`value` = 1 AND b.`id` = 7314 AND b.`value` = 0)
How can I rewrite the ORM query to use two INNER JOINS / use different related instances for the q-objects? Thanks in advance.

i don't think you even need Q-objects for this. You can just use multiple filters like this:
Mainmodel.objects.filter(related__id = 7114, related__value=1).filter(related__id = 7113, related__value=0)
the first filter matches all objects that have a related object with the id 7114 and value 1. The returned objects are filtered once again with the id 7113 and the value 0.

Related

Django Effecient way to Perform Query on M2M

class A(models.Model)
results = models.TextField()
class B(models.Model)
name = models.CharField(max_length=20)
res = models.ManyToManyField(A)
Let's suppose we have above 2 models. A model has millions of objects.
I would like to know what would be the best efficient/fastest way to get all the results objects of a particular B object.
Let's suppose we have to retrieve all results for object number 5 of B
Option 1 : A.objects.filter(b__id=5)
(OR)
Option 2 : B.objects.get(id=5).res.all()
Option 1: My Question is filtering by id on A model objects would take lot of time? since there are millions of A model objects.
Option 2: Question: does res field on B model stores the id value of A model objects?
The reason why I'm assuming the option 2 would be a faster way since it stores the reference of A model objects & directly getting those object values first and making the second query to fetch the results. whereas in the first option filtering by id or any other field would take up a lot of time
The first expression will result in one database query. Indeed, it will query with:
SELECT a.*
FROM a
INNER JOIN a_b ON a_b.a_id = a.id
WHERE a_b.b_id = 5
The second expression will result in two queries. Indeed, first Django will query to fetch that specific B object with a query like:
SELECT b.*
FROM b
WHERE b.id = 5
then it will make exactly the same query to retrieve the related A objects.
But retrieving the A object is here not necessary (unless you of course need it somewhere else). You thus make a useless database query.
My Question is filtering by id on A model objects would take lot of time? since there are millions of A model objects.
A database normally stores an index on foreign key fields. This thus means that it will filter effectively. The total number of A objects is usually not (that) relevant (since it uses a datastructure to accelerate search like a B-tree [wiki]). The wiki page has a section named An index speeds the search that explains how this works.

Annotating Django query sets through reverse foreign keys

Given a simple set of models as follows:
class A(models.Model):
pass
class B(models.Model):
parent = models.ForeignKey(A, related_name='b_set')
class C(models.Model):
parent = models.ForeignKey(B, related_name='c_set')
I am looking to create a query set of the A model with two annotations. One annotation should be the number of B rows that have the A row in question as their parent. The other annotation should denote the number of B rows, again with the A object in question as parent, which have at least n objects of type C in their c_set.
As an example, consider the following database and n = 3:
Table A
id
0
1
Table B
id parent
0 0
1 0
Table C
id parent
0 0
1 0
2 1
3 1
4 1
I'd like to be able to get a result of the form [(0, 2, 1), (1, 0, 0)] as the A object with id 0 has two B objects of which one has at least three related C objects. The A object with id 1 has no B objects and therefore also no B objects with at least three C rows.
The first annotation is trivial:
A.objects.annotate(annotation_1=Count('b_set'))
What I am trying to design now is the second annotation. I have managed to count the number of B rows per A where the B object has at least a single C object as follows:
A.objects.annotate(annotation_2=Count('b_set__c_set__parent', distinct=True))
But I cannot figure out a way to do it with a minimum related set size other than one. Hopefully someone here can point me in the right direction. One method I was thinking of was somehow annotating the B objects in the query instead of the A rows as is the default of the annotate method but I could not find any resources on this.
This is a complicated query at limits of Django 1.11. I decided to do it by two queries and to combine results to one list that can be used by a view like a queryset:
from django.db.models import Count
sub_qs = (
C.objects
.values('parent')
.annotate(c_count=Count('id'))
.order_by()
.filter(c_count__gte=n)
.values('parent')
)
qs = B.objects.filter(id__in=sub_qs).values('parent_id').annotate(cnt=Count('id'))
qs_map = {x['parent_id']: x['cnt'] for x in qs}
rows = list(A.objects.annotate(annotation_1=Count('b_set')))
for row in rows:
row.annotation_2 = qs_map.get(row.id, 0)
The list rows is the result. The more complicated qs.query is compiled to a relative simple SQL:
>>> print(str(qs.query))
SELECT app_b.parent_id, COUNT(app_b.id) AS cnt
FROM app_b
WHERE app_b.id IN (
SELECT U0.parent_id AS Col1 FROM app_c U0
GROUP BY U0.parent_id HAVING COUNT(U0.id) >= 3
)
GROUP BY app_b.parent_id; -- (added white space and removed double quotes)
This simple solution can be easier modified and tested.
Note: A solution by one query also exists, but doesn't seem useful. Why: It would require Subquery and OuterRef(). They are great, however in general Count() from aggregation is not supported by queries that are compiled together with join resolution. A subquery can be separated by lookup ...__in=... to can be compiled by Django, but then it is not possible to use OuterRef(). If it is written without OuterRef() then it is a so complicated not optimal nested SQL that the time complexity would be probably O(n2) by size of A table for many (or all) database backends. Not tested.

How do I use django's Q with django taggit?

I have a Result object that is tagged with "one" and "two". When I try to query for objects tagged "one" and "two", I get nothing back:
q = Result.objects.filter(Q(tags__name="one") & Q(tags__name="two"))
print len(q)
# prints zero, was expecting 1
Why does it not work with Q? How can I make it work?
The way django-taggit implements tagging is essentially through a ManytoMany relationship. In such cases there is a separate table in the database that holds these relations. It is usually called a "through" or intermediate model as it connects the two models. In the case of django-taggit this is called TaggedItem. So you have the Result model which is your model and you have two models Tag and TaggedItem provided by django-taggit.
When you make a query such as Result.objects.filter(Q(tags__name="one")) it translates to looking up rows in the Result table that have a corresponding row in the TaggedItem table that has a corresponding row in the Tag table that has the name="one".
Trying to match for two tag names would translate to looking up up rows in the Result table that have a corresponding row in the TaggedItem table that has a corresponding row in the Tag table that has both name="one" AND name="two". You obviously never have that as you only have one value in a row, it's either "one" or "two".
These details are hidden away from you in the django-taggit implementation, but this is what happens whenever you have a ManytoMany relationship between objects.
To resolve this you can:
Option 1
Query tag after tag evaluating the results each time, as it is suggested in the answers from others. This might be okay for two tags, but will not be good when you need to look for objects that have 10 tags set on them. Here would be one way to do this that would result in two queries and get you the result:
# get the IDs of the Result objects tagged with "one"
query_1 = Result.objects.filter(tags__name="one").values('id')
# use this in a second query to filter the ID and look for the second tag.
results = Result.objects.filter(pk__in=query_1, tags__name="two")
You could achieve this with a single query so you only have one trip from the app to the database, which would look like this:
# create django subquery - this is not evaluated, but used to construct the final query
subquery = Result.objects.filter(pk=OuterRef('pk'), tags__name="one").values('id')
# perform a combined query using a subquery against the database
results = Result.objects.filter(Exists(subquery), tags__name="two")
This would only make one trip to the database. (Note: filtering on sub-queries requires django 3.0).
But you are still limited to two tags. If you need to check for 10 tags or more, the above is not really workable...
Option 2
Query the relationship table instead directly and aggregate the results in a way that give you the object IDs.
# django-taggit uses Content Types so we need to pick up the content type from cache
result_content_type = ContentType.objects.get_for_model(Result)
tag_names = ["one", "two"]
tagged_results = (
TaggedItem.objects.filter(tag__name__in=tag_names, content_type=result_content_type)
.values('object_id')
.annotate(occurence=Count('object_id'))
.filter(occurence=len(tag_names))
.values_list('object_id', flat=True)
)
TaggedItem is the hidden table in the django-taggit implementation that contains the relationships. The above will query that table and aggregate all the rows that refer either to the "one" or "two" tags, group the results by the ID of the objects and then pick those where the object ID had the number of tags you are looking for.
This is a single query and at the end gets you the IDs of all the objects that have been tagged with both tags. It is also the exact same query regardless if you need 2 tags or 200.
Please review this and let me know if anything needs clarification.
first of all, this three are same:
Result.objects.filter(tags__name="one", tags__name="two")
Result.objects.filter(Q(tags__name="one") & Q(tags__name="two"))
Result.objects.filter(tags__name_in=["one"]).filter(tags__name_in=["two"])
i think the name field is CharField and no record could be equal to "one" and "two" at same time.
in python code the query looks like this(always false, and why you are geting no result):
from random import choice
name = choice(["abtin", "shino"])
if name == "abtin" and name == "shino":
we use Q object for implement OR or complex queries
Into the example that works you do an end on two python objects (query sets). That gets applied to any record not necessarily to the same record that has one AND two as tag.
ps: Why do you use the in filter ?
q = Result.objects.filter(tags_name_in=["one"]).filter(tags_name_in=["two"])
add .distinct() to remove duplicates if expecting more than one unique object

django orm - How to use select_related() on the Foreign Key of a Subclass from its Super Class

I've always found the Django orm's handling of subclassing models to be pretty spiffy. That's probably why I run into problems like this one.
Take three models:
class A(models.Model):
field1 = models.CharField(max_length=255)
class B(A):
fk_field = models.ForeignKey('C')
class C(models.Model):
field2 = models.CharField(max_length=255)
So now you can query the A model and get all the B models, where available:
the_as = A.objects.all()
for a in the_as:
print a.b.fk_field.field2 #Note that this throws an error if there is no B record
The problem with this is that you are looking at a huge number of database calls to retrieve all of the data.
Now suppose you wanted to retrieve a QuerySet of all A models in the database, but with all of the subclass records and the subclass's foreign key records as well, using select_related() to limit your app to a single database call. You would write a query like this:
the_as = A.objects.select_related("b", "b__fk_field").all()
One query returns all of the data needed! Awesome.
Except not. Because this version of the query is doing its own filtering, even though select_related is not supposed to filter any results at all:
set_1 = A.objects.select_related("b", "b__fk_field").all() #Only returns A objects with associated B objects
set_2 = A.objects.all() #Returns all A objects
len(set_1) > len(set_2) #Will always be False
I used the django-debug-toolbar to inspect the query and found the problem. The generated SQL query uses an INNER JOIN to join the C table to the query, instead of a LEFT OUTER JOIN like other subclassed fields:
SELECT "app_a"."field1", "app_b"."fk_field_id", "app_c"."field2"
FROM "app_a"
LEFT OUTER JOIN "app_b" ON ("app_a"."id" = "app_b"."a_ptr_id")
INNER JOIN "app_c" ON ("app_b"."fk_field_id" = "app_c"."id");
And it seems if I simply change the INNER JOIN to LEFT OUTER JOIN, then I get the records that I want, but that doesn't help me when using Django's ORM.
Is this a bug in select_related() in Django's ORM? Is there any work around for this, or am I simply going to have to do a direct query of the database and map the results myself? Should I be using something like Django-Polymorphic to do this?
It looks like a bug, specifically it seems to be ignoring the nullable nature of the A->B relationship, if for example you had a foreign key reference to B in A instead of the subclassing, that foreign key would of course be nullable and django would use a left join for it. You should probably raise this in the django issue tracker. You could also try using prefetch_related instead of select_related that might get around your issue.
I found a work around for this, but I will wait a while to accept it in hopes that I can get some better answers.
The INNER JOIN created by the select_related('b__fk_field') needs to be removed from the underlying SQL so that the results aren't filtered by the B records in the database. So the new query needs to leave the b__fk_field parameter in select_related out:
the_as = A.objects.select_related('b')
However, this forces us to call the database everytime a C object is accessed from the A object.
for a in the_as:
#Note that this throws an DoesNotExist error if a doesn't have an
#associated b
print a.b.fk_field.field2 #Hits the database everytime.
The hack to work around this is to get all of the C objects we need from the database from one query and then have each B object reference them manually. We can do this because the database call that accesses the B objects retrieved will have the fk_field_id that references their associated C object:
c_ids = [a.b.fk_field_id for a in the_as] #Get all the C ids
the_cs = C.objects.filter(pk__in=c_ids) #Run a query to get all of the needed C records
for c in the_cs:
for a in the_as:
if a.b.fk_field_id == c.pk: #Throws DoesNotExist if no b associated with a
a.b.fk_field = c
break
I'm sure there's a functional way to write that without the nested loop, but this illustrates what's happening. It's not ideal, but it provides all of the data with the absolute minimum number of database hits - which is what I wanted.

Sort by number of matches on queries based on m2m field

I hope the title is not misleading.
Anyway, I have two models, both have m2m relationships with a third model.
class Model1: keywords = m2m(Keyword)
class Model2: keywords = m2m(Keyword)
Given the keywords for a Model2 instance like this:
keywords2 = model2_instance.keywords.all()
I need to retrieve the Model1 instances which have at least a keyword that is in keywords2, something like:
Model1.objects.filter(keywords__in=keywords2)
and sort them by the number of keywords that match (dont think its possible via 'in' field lookup). Question is, how do i do this?
I'm thinking of just manually interating through each of Model1 instances, appending them to a dictionary of results for every match, but I need this to scale, for say tens of thousands of records. Here is how I imagined it would be like:
result = {}
keywords2_ids = model2.keywords.all().values_list('id',flat=True)
for model1 in Model1.objects.all():
keywords_matched = model1.keywords.filter(id__in=keywords2_ids).count()
objs = result.get(str(keywords_matched), [])
result[str(keywords_matched)] = objs.append(obj)
There must be an faster way to do this. Any ideas?
You can just switch to raw SQL. What you have to do is to write a custom manager for Model1 to return the sorted set of ids of Model1 objects based on the keyword match counts. The SQL is simple as joining the two many to many tables(Django automatically creates a table to represent a many to many relationship) on keyword ids and then grouping on Model1 ids for COUNT sql function. Then using an ORDER BY clause on those counts will produce the sorted Model1 id list you need. In MySQL,
SELECT appname_model1_keywords.model1_id, count(*) as match_count FROM appname_model1_keywords
JOIN appname_model2_keywords
ON (appname_model1_keywords.keyword_id = appname_model2_keywords.keyword_id)
WHERE appname_model2_keywords.model2_id = model2_object_id
GROUP BY appname_model1_keywords.model1_id
ORDER BY match_count
Here model2_object_id is the model2_instance id. This will definitely be faster and more scalable.