Query optimization for recursive many-to-many query

Query optimization for recursive many-to-many query - django

I'm using a small tree/graph package (django_dag) that has that give my model a Many-to-many children field which refers to itself. The basic structure can be shown as the following models
#models
class Foo(FooBase):
class Meta:
abstract = True
children = models.ManyToManyField('self', symmetrical = False,
through = Bar)
class Bar():
parent = models.ForeignKey(Foo)
child = models.ForeignKey(Foo)
All is fine with the models and all the functionality of the package.
FooBase adds a variety of functions to the model, including a way of recursively finding all children of a Foo and the children's children and so forth.
My concern is with the following function within FooBase:
def descendants_tree(self):
tree = {}
for f in self.children.all():
tree[f] = f.descendants_tree()
return tree
It outputs something like {Foo1:{}, Foo2: {Child_of_Foo2: {Child_of_Child_of_Foo2:{}}}} where the progeny are in a nested dictionary.
The alert reader may notice that this method calls a new query for each child.
While these db hits are pretty quick, they can add up quickly when there might be 50+ children. And eventually, there will be tens of thousands of db entries. Right now, each query averages 0.6 msec with a row count of almost 2000.
Is there a more efficient way of doing this nested query?
In my mind, doing a select_related().all() beforehand would get it down to one query but that smells like trouble in the future. At what point is one large query better or worse than many small ones?
---Edit---
Here's what I'm trying to test the select_related().all() option with, but it's still hitting every iteration:
all_foo = Foo.objects.select_related('children').all()
def loop(baz):
tree = {}
for f in all_foo.get(id = baz).children.all()
tree[f] = loop(f)
return tree
I assume the children.all() is causing the hit. Is there another way to get all of Many-to-Many children without using the callable attribute?

You'll have to test under your own environment with your own circumstances. select_related is generally always recommended, but in cases where there will be many recursive levels, that one large query is generally slower than the multiple queries.
The amount of children doesn't really matter, the levels of recursion is what matters most. If you're doing 3 or so, select_related() might be better, but much more than that would likely result in a slow down. The plugin author likely did it this way to allow for many, many levels of recursion, because it only really hurts when there's just a few, and that's only a few extra queries.

Related

Which is a more efficient method, using a list comprehension or django's 'values_list' function?

When attempting to return a list of values from django objects, will performance be better using a list comprehension:
[x.value for x in Model.objects.all()]
or calling list() on django's values_list function:
list(Model.objects.values_list('value', flat=True))
and why?

The most efficient way is to do the second approach (using values_list()). The reason for this is that this modifies the SQL query that is sent to the database to only select the values provided.
The first approach FIRST selects all values from the database, and after that filters them again. So you have already "spend" the resources to fetch all values with that approach.
You can compare the queries generated by wrapping your QuerySet with str(queryset.query) and it will return the actual SQL query that gets executed.
See example below
class Model(models.Model):
foo = models.CharField()
bar = models.CharField()
str(Model.objects.all().query)
# SELECT "model"."id", "model"."foo", "model"."bar" FROM "model"
str(Model.objects.values_list("foo").query)
# SELECT "model"."foo" FROM "model"

I had also somewhat assumed the argument in the currently-accepted answer would be correct. Namely, having a fewer number of fields being fetched would lead to Model.objects.all() taking less time than Model.objects.values_list('foo') to execute. However, I didn't find this in practice when using %timeit.
I actually found that doing
Model.objects.values_list('foo', flat=True) would take ~2-10x longer than just Model.objects.all(). I found this was the case for
an empty django table
a table with 10s of rows
a table with millions of rows
Including/removing flat=True seemed to make no significant difference in executing time for values_list. I would be interested what others find as well?
So this makes me think from a pure "what SQL is executed" point of view, although the values_list ORM query fetches fewer field values from the db, I imagine there is more logic still within the source django code of .all() vs .values_list() which could lead to different additional execution times (including .all() taking less time).
However, to fully address the initial example code, we would also need to factor in any further considerations affecting the execution time due to using a list comprehension [] in the .all() case VS list() in the .values_list() case. The general discussion of list() VS a list comprehension is covered in other questions already.
TLDR So I imagine it is a trade-off between those 2 factors.
the apparent difference in execution time between .values_list() and .all() (which from my tests indicate we can't simply deduce fewer fields being fetched leads to faster execution - more investigation of underlying django source code needed for cause of this)
any differences between using a list comprehension and list()
In my test cases, I generally found the .all() query was actually faster than the .values_list() query, but when also factoring in the transformation to a list, the .values_list scenario would overall take less time. So it may well depend on the scenario...

Optimize Django Rest ORM queries

I a react front-end, django backend (used as REST back).
I've inherited the app, and it loads all the user data using many Models and Serializes. It loads very slow.
It uses a filter to query for a single member, then passes that to a Serializer:
found_account = Accounts.objects.get(id='customer_id')
AccountDetailsSerializer(member, context={'request': request}).data
Then there are so many various nested Serializers:
AccountDetailsSerializers(serializers.ModelSerializer):
Invoices = InvoiceSerializer(many=True)
Orders = OrderSerializer(many=True)
....
From looking at the log, looks like the ORM issues so many queries, it's crazy, for some endpoints we end up with like 50 - 60 queries.
Should I attempt to look into using select_related and prefetch or would you skip all of that and just try to write one sql query to do multiple joins and fetch all the data at once as json?
How can I define the prefetch / select_related when I pass in a single object (result of get), and not a queryset to the serializer?
Some db entities don't have links between them, meaning not fk or manytomany relationships, just hold a field that has an id to another, but the relationship is not enforced in the database? Will this be an issue for me? Does it mean once more that I should skip the select_related approach and write a customer sql for fetching?
How would you suggest to approach performance tuning this nightmare of queries?

I recommend initially seeing what effects you get with prefetch_related. It can have a major impact on load time, and is fairly trivial to implement. Going by your example above something like this could alone reduce load time significantly:
AccountDetailsSerializers(serializers.ModelSerializer):
class Meta:
model = AccountDetails
fields = (
'invoices',
'orders',
)
invoices = serializers.SerializerMethodField()
orders = serializers.SerializerMethodField()
def get_invoices(self, obj):
qs = obj.invoices.all()\
.prefetch_related('invoice_sub_object_1')\
.prefetch_related('invoice_sub_object_2')
return InvoiceSerializer(qs, many=True, read_only=True).data
def get_orders(self, obj):
qs = obj.orders.all()\
.prefetch_related('orders_sub_object_1')\
.prefetch_related('orders_sub_object_2')
return OrderSerializer(qs, many=True, read_only=True).data
As for your question of architecture, I think a lot of other factors play in as to whether and to which degree you should refactor the codebase. In general though, if you are married to Django and DRF, you'll have a better developer experience if you can embrace the idioms and patterns of those frameworks, instead of trying to buypass them with your own fixes.

There's no silver bullet without looking at the code (and the profiling results) in detail.
The only thing that is a no-brainer is enforcing relationships in the models and in the database. This prevents a whole host of bugs, encourages the use of standardized, performant access (rather than concocting SQL on the spot which more often than not is likely to be buggy and slow) and makes your code both shorter and a lot more readable.
Other than that, 50-60 queries can be a lot (if you could do the same job with one or two) or it can be just right - it depends on what you achieve with them.
The use of prefetch_related and select_related is important, yes – but only if used correctly; otherwise it can slow you down instead of speeding you up.
Nested serializers are the correct approach if you need the data – but you need to set up your querysets properly in your viewset if you want them to be fast.
Time the main parts of slow views, inspect the SQL queries sent and check if you really need all data that is returned.
Then you can look at the sore spots and gain time where it matters. Asking specific questions on SO with complete code examples can also get you far fast.
If you have just one top-level object, you can refine the approach offered by #jensmtg, doing all the prefetches that you need at that level and then for the lower levels just using ModelSerializers (not SerializerMethodFields) that access the prefetched objects. Look into the Prefetch object that allows nested prefetching.
But be aware that prefetch_related is not for free, it involves quite some processing in Python; you may be better off using flat (db-view-like) joined queries with values() and values_list.

django restframework: how to efficiently search on many to many related fields?

Given these models:
class B:
my_field = TextField()
class A:
b = ManyToMany(B)
I have +50K rows in A, when searching for elements I want to do full text searches on my_field by traversing the many to many field b (i.e. b__my_field).
This works fine when the number of many to many elements Bper A object is less than ~3. How ever if I have something greater than that performance drops impressively.
Wondering if I could do some sort of prefetch related search? Is something like haystack my only option?

When you loop through a query set, django makes a database request for each step of your loop. See this for exampleon ORM pitfalls. A thing that you should learn when using django ORM is to use commands to avoid database roundtrips as much as possible. One way to do that is with values() function. Ideally you should get only what you need too.
Try this:
l = list(A.b.all().values('my_field'))
This guarantees only one database query, and return a list that you can loop through in python speed. Should be much faster.

Designing a database for a user/points system? (in Django)

First of all, sorry if this isn't an appropriate question for StackOverflow. I've tried to make it as generalisable as possible.
I want to create a database (MySQL, site running Django) that has users, who can be allocated a certain number of points for various types of action - it's a collaborative game. My requirements are to obtain:
the number of points a user has
the user's ranking compared to all other users
and the overall leaderboard (i.e. all users ranked in order of points)
This is what I have so far, in my Django models.py file:
class SiteUser(models.Model):
name = models.CharField(max_length=250 )
email = models.EmailField(max_length=250 )
date_added = models.DateTimeField(auto_now_add=True)
def points_total(self):
points_added = PointsAdded.objects.filter(user=self)
points_total = 0
for point in points_added:
points_total += point.points
return points_total
class PointsAdded(models.Model):
user = models.ForeignKey('SiteUser')
action = models.ForeignKey('Action')
date_added = models.DateTimeField(auto_now_add=True)
def points(self):
points = Action.objects.filter(action=self.action)
return points
class Action(models.Model):
points = models.IntegerField()
action = models.CharField(max_length=36)
However it's rapidly becoming clear to me that it's actually quite complex (in Django query terms at least) to figure out the user's ranking and return the leaderboard of users. At least, I'm finding it tough. Is there a more elegant way to do something like this?
This question seems to suggest that I shouldn't even have a separate points table - what do people think? It feels more robust to have separate tables, but I don't have much experience of database design.

this is old, but I'm not sure exactly why you have 2 separate tables (Points Added & Action). It's late, so maybe my mind isn't ticking, but it seems like you just separated one table into 2 for some reason. It doesn't seem like you get any benefit out of it. It's not like there's a 1 to many relationship in it right?
So first of all, I would combine those two tables. Secondly, you are probably better off storing points_total into a value in your site_user table. This is what I think Demitry is trying to allude to, but didn't say explicitly. This way instead of doing this whole additional query (pulling everything a user has done in his history of the site is expensive) + looping action (going through it is even more expensive), you can just pull it as one field. It's denormalizing the data for a greater good.
Just be sure to update the value everytime you add in something that has points. You can use django's post_save signal to do that

It's a bit more difficult to have points saved in the same table, but it's totally worth it. You can do very simple ordering/filtering operations if you have computed points total on user model. And you can count totals only when something changes (not every time you want to show them). Just put some validation logic into post_save signals and make sure to cover this logic with tests and you're good.
p.s. denormalization on wiki.

Django ORM: Optimizing queries involving many-to-many relations

I have the following model structure:
class Container(models.Model):
pass
class Generic(models.Model):
name = models.CharacterField(unique=True)
cont = models.ManyToManyField(Container, null=True)
# It is possible to have a Generic object not associated with any container,
# thats why null=True
class Specific1(Generic):
...
class Specific2(Generic):
...
...
class SpecificN(Generic):
...
Say, I need to retrieve all Specific-type models, that have a relationship with a particular Container.
The SQL for that is more or less trivial, but that is not the question. Unfortunately, I am not very experienced at working with ORMs (Django's ORM in particular), so I might be missing a pattern here.
When done in a brute-force manner, -
c = Container.objects.get(name='somename') # this gets me the container
items = c.generic_set.all()
# this gets me all Generic objects, that are related to the container
# Now what? I need to get to the actual Specific objects, so I need to somehow
# get the type of the underlying Specific object and get it
for item in items:
spec = getattr(item, item.get_my_specific_type())
this results in a ton of db hits (one for each Generic record, that relates to a Container), so this is obviously not the way to do it. Now, it could, perhaps, be done by getting the SpecificX objects directly:
s = Specific1.objects.filter(cont__name='somename')
# This gets me all Specific1 objects for the specified container
...
# do it for every Specific type
that way the db will be hit once for each Specific type (acceptable, I guess).
I know, that .select_related() doesn't work with m2m relationships, so it is not of much help here.
To reiterate, the end result has to be a collection of SpecificX objects (not Generic).

I think you've already outlined the two easy possibilities. Either you do a single filter query against Generic and then cast each item to its Specific subtype (results in n+1 queries, where n is the number of items returned), or you make a separate query against each Specific table (results in k queries, where k is the number of Specific types).
It's actually worth benchmarking to see which of these is faster in reality. The second seems better because it's (probably) fewer queries, but each one of those queries has to perform a join with the m2m intermediate table. In the former case you only do one join query, and then many simple ones. Some database backends perform better with lots of small queries than fewer, more complex ones.
If the second is actually significantly faster for your use case, and you're willing to do some extra work to clean up your code, it should be possible to write a custom manager method for the Generic model that "pre-fetches" all the subtype data from the relevant Specific tables for a given queryset, using only one query per subtype table; similar to how this snippet optimizes generic foreign keys with a bulk prefetch. This would give you the same queries as your second option, with the DRYer syntax of your first option.

Not a complete answer but you can avoid a great number of hits by doing this
items= list(items)
for item in items:
spec = getattr(item, item.get_my_specific_type())
instead of this :
for item in items:
spec = getattr(item, item.get_my_specific_type())
Indeed, by forcing a cast to a python list, you force the django orm to load all elements in your queryset. It then does this in one query.

I accidentally stubmled upon the following post, which pretty much answers your question :
http://lazypython.blogspot.com/2008/11/timeline-view-in-django.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js