Optimize Django Rest ORM queries - django

I a react front-end, django backend (used as REST back).
I've inherited the app, and it loads all the user data using many Models and Serializes. It loads very slow.
It uses a filter to query for a single member, then passes that to a Serializer:
found_account = Accounts.objects.get(id='customer_id')
AccountDetailsSerializer(member, context={'request': request}).data
Then there are so many various nested Serializers:
AccountDetailsSerializers(serializers.ModelSerializer):
Invoices = InvoiceSerializer(many=True)
Orders = OrderSerializer(many=True)
....
From looking at the log, looks like the ORM issues so many queries, it's crazy, for some endpoints we end up with like 50 - 60 queries.
Should I attempt to look into using select_related and prefetch or would you skip all of that and just try to write one sql query to do multiple joins and fetch all the data at once as json?
How can I define the prefetch / select_related when I pass in a single object (result of get), and not a queryset to the serializer?
Some db entities don't have links between them, meaning not fk or manytomany relationships, just hold a field that has an id to another, but the relationship is not enforced in the database? Will this be an issue for me? Does it mean once more that I should skip the select_related approach and write a customer sql for fetching?
How would you suggest to approach performance tuning this nightmare of queries?

I recommend initially seeing what effects you get with prefetch_related. It can have a major impact on load time, and is fairly trivial to implement. Going by your example above something like this could alone reduce load time significantly:
AccountDetailsSerializers(serializers.ModelSerializer):
class Meta:
model = AccountDetails
fields = (
'invoices',
'orders',
)
invoices = serializers.SerializerMethodField()
orders = serializers.SerializerMethodField()
def get_invoices(self, obj):
qs = obj.invoices.all()\
.prefetch_related('invoice_sub_object_1')\
.prefetch_related('invoice_sub_object_2')
return InvoiceSerializer(qs, many=True, read_only=True).data
def get_orders(self, obj):
qs = obj.orders.all()\
.prefetch_related('orders_sub_object_1')\
.prefetch_related('orders_sub_object_2')
return OrderSerializer(qs, many=True, read_only=True).data
As for your question of architecture, I think a lot of other factors play in as to whether and to which degree you should refactor the codebase. In general though, if you are married to Django and DRF, you'll have a better developer experience if you can embrace the idioms and patterns of those frameworks, instead of trying to buypass them with your own fixes.

There's no silver bullet without looking at the code (and the profiling results) in detail.
The only thing that is a no-brainer is enforcing relationships in the models and in the database. This prevents a whole host of bugs, encourages the use of standardized, performant access (rather than concocting SQL on the spot which more often than not is likely to be buggy and slow) and makes your code both shorter and a lot more readable.
Other than that, 50-60 queries can be a lot (if you could do the same job with one or two) or it can be just right - it depends on what you achieve with them.
The use of prefetch_related and select_related is important, yes – but only if used correctly; otherwise it can slow you down instead of speeding you up.
Nested serializers are the correct approach if you need the data – but you need to set up your querysets properly in your viewset if you want them to be fast.
Time the main parts of slow views, inspect the SQL queries sent and check if you really need all data that is returned.
Then you can look at the sore spots and gain time where it matters. Asking specific questions on SO with complete code examples can also get you far fast.
If you have just one top-level object, you can refine the approach offered by #jensmtg, doing all the prefetches that you need at that level and then for the lower levels just using ModelSerializers (not SerializerMethodFields) that access the prefetched objects. Look into the Prefetch object that allows nested prefetching.
But be aware that prefetch_related is not for free, it involves quite some processing in Python; you may be better off using flat (db-view-like) joined queries with values() and values_list.

Related

Optimising API queries using JSONField()

Initial opening: I am utilising postgresql JSONFields.
I have the following attribute (field) in my User model:
class User(AbstractUser):
...
benefits = JSONField(default=dict())
...
I essentially currently serialize benefits for each User on the front end with DRF:
benefits = UserBenefit.objects.filter(user=self)
serializer = UserBenefitSerializer(benefits, many=True)
As the underlying returned benefits changes little and slowly, I thought about "caching" the JSON in the database every time there is a change to improve the performance of the UserBenefit.objects.filter(user=user) QuerySet. Instead, becoming user.benefits and hopefully lightening DB load over 100K+ users.
1st Q:
Should I do this?
2nd Q:
Is there an efficient way to write the corresponding serializer.data <class 'rest_framework.utils.serializer_helpers.ReturnList'> to the JSON field?
I am currently using:
data = serializers.serialize("json", UserBenefit.objects.filter(user=self))
For your first question:
It's not a bad idea if you don't want to use caching alternatives.
If you have to query the database because of some changes or ... and you can't cache the hole request, then the idea of saving a JSON object can be a pretty good idea. This way you only retrieve the data and skip most parts of serializing and also terminate the need to query a pivot table to get the m2m data. But also note that this way, you are adding a whole bunch of extra data to your rows and unless you're going to need them most of the time, and you will get extra data that you don't really need which you can help it using values function on querysets but still it requires more coding. Basically, you're going to use more bandwidth for your first query and more storage to store the data instead of process power. Also, the pagination will be really hard to achieve on your benefits if you need it at some point.
Getting m2m relation data is usually pretty fast depending on the amount of data you have on your database but the ultimate way of getting better performance is caching the requests and reducing the database hits as much as possible.
And as you probably hear it a lot, you should test and benchmark to see which options really works for you the best depending on your requirements and limitations. It's really hard to suggest an optimization method without knowing the information about the whole scope and the current solution.
And for your second question:
I think I don't really get it. If you are storing a JSON object which is a field in User model, then why do you need data = serializers.serialize("json", UserBenefit.objects.filter(user=self)) ?
You don't need it since the serializer can just return the JSON field data.

Should I use JSONField over ForeignKey to store data?

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

How to optimize this specific Queryset?

In a Django application, I have a set of companies with a history of actions that have been performed in the past. I retrieve those transactions with the following, scandalously expensive Queryset:
class Company(AbstractBaseUser, PermissionsMixin):
...
def get_transactions_history(self):
return Transaction.objects.filter(sponsorship__campaign__shop__company=self)
Obviously, this leads to a lot of JOIN instructions from the ORM and since the number of Transactions can increase rapidly, the throughput on the db is also exploding because of this Queryset.
Assuming there is no shortcut between Transaction and Company other than chaining through Sponsorship, Campaign and Shop as exposed above, how would you optimize the Queryset without touching at the database schema?
I don't think you really can but you can be smarter with how you use it.
Theres cached_property
#cached_property
def history(self):
Which does exactly what it says it is.
Otherwise you need to split this into a recent_history that splices the queryset
.filter(...)[:10]
or pagination
If you can't optimize it at the SQL level then there's not much you can do at the ORM level, obviously.

Using django prefetch_related() to get the time of last activity

I upgraded to Django 1.7 so I could get Prefetch objects, but I'm having a hard time getting them to behave as expected.
I have an Employee model like this:
class Employee(Human):
... additional Employee Fields ...
def get_last_activity_date(self):
try:
return self.activity_set.all().order_by('-when')[0:1].get().when
except Activity.DoesNotExist:
return None
and activities like this
class Activity(models.Model):
when = models.DateTimeField()
employee = models.ForeignKey(Employee, related_name='activity_set')
I want to use prefetch_related to get the last activity date for this employee. I've tried to express this many ways, but no matter how I do it, it ends up generating another query. My other 2 my prefetch_related parts work as expected, but this one does not ever seem to save me any queries.
I'm using this with Django Rest Framework, so I really need the prefetch_related part to work since I have no way of reaching inside DRF to do the mapping outside of the queryset.
Here is one of the ways that DOES NOT WORK
def get_queryset(self):
return super(EmployeeViewSet, self).get_queryset()\
.prefetch_related('phone_number_set', 'email_address_set')\
.prefetch_related(Prefetch('activity_set', Activity.objects.all().order_by('-when')))\
.order_by('last_name', 'first_name')
Notice that on the activity_set prefetch query that I can't slice to only get the latest entry either which is a concern in terms of how much memory this is going to eat up.
I do actually see the prefetch query take place, but then each employee gets a separate query for that piece of information, meaning I have one bigger wasted query and still get the ~200 querys I'm trying to prevent.
How do you get the prefetch_related to work for me in this case?
I suspect You are missing the point of prefetch_related. The docs state that this is the expected behavior: many queries and the 'joining' in python. If you want less queries you should use select_related Also I'm not sure It would work with your specific models (not stated in the question) since select_related does work well with many to many relations.
UPDATE - the docs:
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python

How to retrieve values from Django ForeignKey -> ManyToMany fields?

I have a model (Realtor) with a ForeignKey field (BillingTier), which has a ManyToManyField (BillingPlan). For each logged in realtor, I want to check if they have a billing plan that offers automatic feedback on their listings. Here's what the models look like, briefly:
class Realtor(models.Model):
user = models.OneToOneField(User)
billing_tier = models.ForeignKey(BillingTier, blank=True, null=True, default=None)
class BillingTier(models.Model):
plans = models.ManyToManyField(BillingPlan)
class BillingPlan(models.Model):
automatic_feedback = models.BooleanField(default=False)
I have a permissions helper that checks the user permissions on each page load, and denies access to certain pages. I want to deny the feedback page if they don't have the automatic feedback feature in their billing plan. However, I'm not really sure the best way to get this information. Here's what I've researched and found so far, but it seems inefficient to be querying on each page load:
def isPermitted(user, url):
premium = [t[0] for t in user.realtor.billing_tier.plans.values_list('automatic_feedback') if t[0]]
I saw some solutions which involved using filter (ManyToMany field values from queryset), but I'm equally unsure of using the query for each page load. I would have to get the billing tier id from the realtor: bt_id = user.realtor.billing_tier.id and then query the model like so:
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct()
I think the second option reads nicer, but I think the first would perform better because I wouldn't have to import and query the BillingTier model.
Is there a better option, or are these two the best I can hope for? Also, which would be more efficient for every page load?
As per the OP's invitation, here's an answer.
The core question is how to define an efficient permission check based on a highly relational data model.
The first variant involves building a Python list from evaluating a Django query set. The suspicion must certainly be that it imposes unnecessary computations on the Python interpreter. Although it's not clear whether that's tolerable if at the same time it allows for a less complex database query (a tradeoff which is hard to assess), the underlying DB query is not exactly simple.
The second approach involves fetching additional 1:1 data through relational lookups and then checking if there is any record fulfilling access criteria in a different, 1:n relation.
Let's have a look at them.
bt_id = user.realtor.billing_tier.id: This is required to get the hook for the following 1:n query. It is indeed highly inefficient in itself. It can be optimized in two ways.
As per Django: Access Foreign Keys Directly, it can be written as bt_id = user.realtor.billing_tier_id because the id is of course present in billing_tier and needs not be found via a relational operation.
Assuming that the page in itself would only load a user object, Django can be told to fetch and cache relational data along with that via select_related. So if the page does not only fetch the user object but the required billing_tier_id as well, we have saved one additional DB hit.
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct() can be optimized using Django's exists because that will redurce efforts both in the database and regarding data traffic between the database and Python.
Maybe even Django's prefetch_related can be used to combine the 1:1 and 1:n queries into a single query, but it's much more difficult to judge whether that pays. Could be worth a try.
In any case, it's worth installing a gem called Django Debug Toolbar which will allow you to analyze how much time your implementation spends on database queries.