Django ORM: Optimizing queries involving many-to-many relations - django

I have the following model structure:
class Container(models.Model):
pass
class Generic(models.Model):
name = models.CharacterField(unique=True)
cont = models.ManyToManyField(Container, null=True)
# It is possible to have a Generic object not associated with any container,
# thats why null=True
class Specific1(Generic):
...
class Specific2(Generic):
...
...
class SpecificN(Generic):
...
Say, I need to retrieve all Specific-type models, that have a relationship with a particular Container.
The SQL for that is more or less trivial, but that is not the question. Unfortunately, I am not very experienced at working with ORMs (Django's ORM in particular), so I might be missing a pattern here.
When done in a brute-force manner, -
c = Container.objects.get(name='somename') # this gets me the container
items = c.generic_set.all()
# this gets me all Generic objects, that are related to the container
# Now what? I need to get to the actual Specific objects, so I need to somehow
# get the type of the underlying Specific object and get it
for item in items:
spec = getattr(item, item.get_my_specific_type())
this results in a ton of db hits (one for each Generic record, that relates to a Container), so this is obviously not the way to do it. Now, it could, perhaps, be done by getting the SpecificX objects directly:
s = Specific1.objects.filter(cont__name='somename')
# This gets me all Specific1 objects for the specified container
...
# do it for every Specific type
that way the db will be hit once for each Specific type (acceptable, I guess).
I know, that .select_related() doesn't work with m2m relationships, so it is not of much help here.
To reiterate, the end result has to be a collection of SpecificX objects (not Generic).

I think you've already outlined the two easy possibilities. Either you do a single filter query against Generic and then cast each item to its Specific subtype (results in n+1 queries, where n is the number of items returned), or you make a separate query against each Specific table (results in k queries, where k is the number of Specific types).
It's actually worth benchmarking to see which of these is faster in reality. The second seems better because it's (probably) fewer queries, but each one of those queries has to perform a join with the m2m intermediate table. In the former case you only do one join query, and then many simple ones. Some database backends perform better with lots of small queries than fewer, more complex ones.
If the second is actually significantly faster for your use case, and you're willing to do some extra work to clean up your code, it should be possible to write a custom manager method for the Generic model that "pre-fetches" all the subtype data from the relevant Specific tables for a given queryset, using only one query per subtype table; similar to how this snippet optimizes generic foreign keys with a bulk prefetch. This would give you the same queries as your second option, with the DRYer syntax of your first option.

Not a complete answer but you can avoid a great number of hits by doing this
items= list(items)
for item in items:
spec = getattr(item, item.get_my_specific_type())
instead of this :
for item in items:
spec = getattr(item, item.get_my_specific_type())
Indeed, by forcing a cast to a python list, you force the django orm to load all elements in your queryset. It then does this in one query.

I accidentally stubmled upon the following post, which pretty much answers your question :
http://lazypython.blogspot.com/2008/11/timeline-view-in-django.html

Related

Django table or Dict: performance?

I have multiple small key/value tables in Django, and there value never change
ie: 1->"Active", 2->"Down", 3->"Running"....
and multiple times, I do some get by id and other time by name.
So I'm asking, if it's not more optimize to move them all as Dict (global or in models) ?
thank you
Generally django querysets are slower than dicts, so if you want to write model with one field that has these statuses (active, down, running) it's generally better to use dict until there is need for editability.
Anyway I don't understand this kind of question, the performance benefits are not really high until you got ~10k+ records in single QS, and even by then you can cast the whole model to list by using .values_list syntax. Execution will take approximately part of second.
Also if I understand, these values should be anyway in models.CharField with choices field set, rather than set up by fixture in models.ForeignKey.

Which is a more efficient method, using a list comprehension or django's 'values_list' function?

When attempting to return a list of values from django objects, will performance be better using a list comprehension:
[x.value for x in Model.objects.all()]
or calling list() on django's values_list function:
list(Model.objects.values_list('value', flat=True))
and why?
The most efficient way is to do the second approach (using values_list()). The reason for this is that this modifies the SQL query that is sent to the database to only select the values provided.
The first approach FIRST selects all values from the database, and after that filters them again. So you have already "spend" the resources to fetch all values with that approach.
You can compare the queries generated by wrapping your QuerySet with str(queryset.query) and it will return the actual SQL query that gets executed.
See example below
class Model(models.Model):
foo = models.CharField()
bar = models.CharField()
str(Model.objects.all().query)
# SELECT "model"."id", "model"."foo", "model"."bar" FROM "model"
str(Model.objects.values_list("foo").query)
# SELECT "model"."foo" FROM "model"
I had also somewhat assumed the argument in the currently-accepted answer would be correct. Namely, having a fewer number of fields being fetched would lead to Model.objects.all() taking less time than Model.objects.values_list('foo') to execute. However, I didn't find this in practice when using %timeit.
I actually found that doing
Model.objects.values_list('foo', flat=True) would take ~2-10x longer than just Model.objects.all(). I found this was the case for
an empty django table
a table with 10s of rows
a table with millions of rows
Including/removing flat=True seemed to make no significant difference in executing time for values_list. I would be interested what others find as well?
So this makes me think from a pure "what SQL is executed" point of view, although the values_list ORM query fetches fewer field values from the db, I imagine there is more logic still within the source django code of .all() vs .values_list() which could lead to different additional execution times (including .all() taking less time).
However, to fully address the initial example code, we would also need to factor in any further considerations affecting the execution time due to using a list comprehension [] in the .all() case VS list() in the .values_list() case. The general discussion of list() VS a list comprehension is covered in other questions already.
TLDR So I imagine it is a trade-off between those 2 factors.
the apparent difference in execution time between .values_list() and .all() (which from my tests indicate we can't simply deduce fewer fields being fetched leads to faster execution - more investigation of underlying django source code needed for cause of this)
any differences between using a list comprehension and list()
In my test cases, I generally found the .all() query was actually faster than the .values_list() query, but when also factoring in the transformation to a list, the .values_list scenario would overall take less time. So it may well depend on the scenario...

django restframework: how to efficiently search on many to many related fields?

Given these models:
class B:
my_field = TextField()
class A:
b = ManyToMany(B)
I have +50K rows in A, when searching for elements I want to do full text searches on my_field by traversing the many to many field b (i.e. b__my_field).
This works fine when the number of many to many elements Bper A object is less than ~3. How ever if I have something greater than that performance drops impressively.
Wondering if I could do some sort of prefetch related search? Is something like haystack my only option?
When you loop through a query set, django makes a database request for each step of your loop. See this for exampleon ORM pitfalls. A thing that you should learn when using django ORM is to use commands to avoid database roundtrips as much as possible. One way to do that is with values() function. Ideally you should get only what you need too.
Try this:
l = list(A.b.all().values('my_field'))
This guarantees only one database query, and return a list that you can loop through in python speed. Should be much faster.

How does Django go about filtering an evaluated queryset?

I've cached a common queryset, which I would like to filter based off of different fields depending on the situation. I'm wondering if by filtering an evaluated queryset if I lose the advantage of caching it in the first place; does Django just create another queryset from scratch that's an aggregate of the querysets involved in creating the cached queryset and the filter that I apply afterwards?
Yes, the results get thrown out.
You can see this from the source: filter() calls _filter_or_exclude(), which calls _clone() and then adds to its query. _clone, you can see, doesn't set the _result_cache attribute.
In general, it's not really clear what it could possibly do to keep the common results. If it's a complicated query with a small result set, it could be replaced by just issuing SQL that checks that the primary key is one of the results you've found, but that's not always going to be more efficient, and in some situations it would confusingly mess with the semantics (if the DB changes in a way that affects the query results in the time between when it's cached and when you do the additional filter).
If you want to force this behavior of saving the IDs manually, you can do that:
pks = SomeObject.objects.filter(...).values_list('pk', flat=True)
some_of_them = SomeObject.objects.filter(pk_in=pks).filter(...)
others = SomeObject.objects.filter(pk_in=pks).filter(...)
You can also of course just do the filtering in Python, e.g. by
common = SomeObject.objects.filter(...)
some_of_them = [m for m in common if m.attribute == 'foo']
others = [m for m in common if m.other_attribute == 'bar']
(You could also use filter(lambda m: m.attribute == 'foo', common) if you preferred, or wrap the definition of common in list to be more explicit.)
Whether one of these or reissuing the query depends a lot on the size of the sets involved, the complexity of the filters, and what indices are present.

How to limit columns returned by Django query?

That seems simple enough, but all Django Queries seems to be 'SELECT *'
How do I build a query returning only a subset of fields ?
In Django 1.1 onwards, you can use defer('col1', 'col2') to exclude columns from the query, or only('col1', 'col2') to only get a specific set of columns. See the documentation.
values does something slightly different - it only gets the columns you specify, but it returns a list of dictionaries rather than a set of model instances.
Append a .values("column1", "column2", ...) to your query
The accepted answer advising defer and only which the docs discourage in most cases.
only use defer() when you cannot, at queryset load time, determine if you will need the extra fields or not. If you are frequently loading and using a particular subset of your data, the best choice you can make is to normalize your models and put the non-loaded data into a separate model (and database table). If the columns must stay in the one table for some reason, create a model with Meta.managed = False (see the managed attribute documentation) containing just the fields you normally need to load and use that where you might otherwise call defer(). This makes your code more explicit to the reader, is slightly faster and consumes a little less memory in the Python process.