Using django prefetch_related() to get the time of last activity - django

I upgraded to Django 1.7 so I could get Prefetch objects, but I'm having a hard time getting them to behave as expected.
I have an Employee model like this:
class Employee(Human):
... additional Employee Fields ...
def get_last_activity_date(self):
try:
return self.activity_set.all().order_by('-when')[0:1].get().when
except Activity.DoesNotExist:
return None
and activities like this
class Activity(models.Model):
when = models.DateTimeField()
employee = models.ForeignKey(Employee, related_name='activity_set')
I want to use prefetch_related to get the last activity date for this employee. I've tried to express this many ways, but no matter how I do it, it ends up generating another query. My other 2 my prefetch_related parts work as expected, but this one does not ever seem to save me any queries.
I'm using this with Django Rest Framework, so I really need the prefetch_related part to work since I have no way of reaching inside DRF to do the mapping outside of the queryset.
Here is one of the ways that DOES NOT WORK
def get_queryset(self):
return super(EmployeeViewSet, self).get_queryset()\
.prefetch_related('phone_number_set', 'email_address_set')\
.prefetch_related(Prefetch('activity_set', Activity.objects.all().order_by('-when')))\
.order_by('last_name', 'first_name')
Notice that on the activity_set prefetch query that I can't slice to only get the latest entry either which is a concern in terms of how much memory this is going to eat up.
I do actually see the prefetch query take place, but then each employee gets a separate query for that piece of information, meaning I have one bigger wasted query and still get the ~200 querys I'm trying to prevent.
How do you get the prefetch_related to work for me in this case?

I suspect You are missing the point of prefetch_related. The docs state that this is the expected behavior: many queries and the 'joining' in python. If you want less queries you should use select_related Also I'm not sure It would work with your specific models (not stated in the question) since select_related does work well with many to many relations.
UPDATE - the docs:
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python

Related

Optimize Django Rest ORM queries

I a react front-end, django backend (used as REST back).
I've inherited the app, and it loads all the user data using many Models and Serializes. It loads very slow.
It uses a filter to query for a single member, then passes that to a Serializer:
found_account = Accounts.objects.get(id='customer_id')
AccountDetailsSerializer(member, context={'request': request}).data
Then there are so many various nested Serializers:
AccountDetailsSerializers(serializers.ModelSerializer):
Invoices = InvoiceSerializer(many=True)
Orders = OrderSerializer(many=True)
....
From looking at the log, looks like the ORM issues so many queries, it's crazy, for some endpoints we end up with like 50 - 60 queries.
Should I attempt to look into using select_related and prefetch or would you skip all of that and just try to write one sql query to do multiple joins and fetch all the data at once as json?
How can I define the prefetch / select_related when I pass in a single object (result of get), and not a queryset to the serializer?
Some db entities don't have links between them, meaning not fk or manytomany relationships, just hold a field that has an id to another, but the relationship is not enforced in the database? Will this be an issue for me? Does it mean once more that I should skip the select_related approach and write a customer sql for fetching?
How would you suggest to approach performance tuning this nightmare of queries?
I recommend initially seeing what effects you get with prefetch_related. It can have a major impact on load time, and is fairly trivial to implement. Going by your example above something like this could alone reduce load time significantly:
AccountDetailsSerializers(serializers.ModelSerializer):
class Meta:
model = AccountDetails
fields = (
'invoices',
'orders',
)
invoices = serializers.SerializerMethodField()
orders = serializers.SerializerMethodField()
def get_invoices(self, obj):
qs = obj.invoices.all()\
.prefetch_related('invoice_sub_object_1')\
.prefetch_related('invoice_sub_object_2')
return InvoiceSerializer(qs, many=True, read_only=True).data
def get_orders(self, obj):
qs = obj.orders.all()\
.prefetch_related('orders_sub_object_1')\
.prefetch_related('orders_sub_object_2')
return OrderSerializer(qs, many=True, read_only=True).data
As for your question of architecture, I think a lot of other factors play in as to whether and to which degree you should refactor the codebase. In general though, if you are married to Django and DRF, you'll have a better developer experience if you can embrace the idioms and patterns of those frameworks, instead of trying to buypass them with your own fixes.
There's no silver bullet without looking at the code (and the profiling results) in detail.
The only thing that is a no-brainer is enforcing relationships in the models and in the database. This prevents a whole host of bugs, encourages the use of standardized, performant access (rather than concocting SQL on the spot which more often than not is likely to be buggy and slow) and makes your code both shorter and a lot more readable.
Other than that, 50-60 queries can be a lot (if you could do the same job with one or two) or it can be just right - it depends on what you achieve with them.
The use of prefetch_related and select_related is important, yes – but only if used correctly; otherwise it can slow you down instead of speeding you up.
Nested serializers are the correct approach if you need the data – but you need to set up your querysets properly in your viewset if you want them to be fast.
Time the main parts of slow views, inspect the SQL queries sent and check if you really need all data that is returned.
Then you can look at the sore spots and gain time where it matters. Asking specific questions on SO with complete code examples can also get you far fast.
If you have just one top-level object, you can refine the approach offered by #jensmtg, doing all the prefetches that you need at that level and then for the lower levels just using ModelSerializers (not SerializerMethodFields) that access the prefetched objects. Look into the Prefetch object that allows nested prefetching.
But be aware that prefetch_related is not for free, it involves quite some processing in Python; you may be better off using flat (db-view-like) joined queries with values() and values_list.

Should I use JSONField over ForeignKey to store data?

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

How to optimize this specific Queryset?

In a Django application, I have a set of companies with a history of actions that have been performed in the past. I retrieve those transactions with the following, scandalously expensive Queryset:
class Company(AbstractBaseUser, PermissionsMixin):
...
def get_transactions_history(self):
return Transaction.objects.filter(sponsorship__campaign__shop__company=self)
Obviously, this leads to a lot of JOIN instructions from the ORM and since the number of Transactions can increase rapidly, the throughput on the db is also exploding because of this Queryset.
Assuming there is no shortcut between Transaction and Company other than chaining through Sponsorship, Campaign and Shop as exposed above, how would you optimize the Queryset without touching at the database schema?
I don't think you really can but you can be smarter with how you use it.
Theres cached_property
#cached_property
def history(self):
Which does exactly what it says it is.
Otherwise you need to split this into a recent_history that splices the queryset
.filter(...)[:10]
or pagination
If you can't optimize it at the SQL level then there's not much you can do at the ORM level, obviously.

Why is the database used by Django subqueries not sticky?

I have a concern with django subqueries using the django ORM. When we fetch a queryset or perform a DB operation, I have the option of bypassing all assumptions that django might make for the database that needs to be used by forcing usage of the specific database that I want.
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
The above disregards any database routers I might have set and goes straight to 'some_db'.
But if my models approximately look like so :-
class Author(models.Model):
author_name=models.CharField(max_length=255)
author_address=models.CharField(max_length=255)
class Book(models.Model):
book_name=models.CharField(max_length=255)
author=models.ForeignKey(Author, null = True)
And I fetch a QuerySet representing all books that are called Mark like so:-
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
Then later if somewhere in the code I trigger a subquery by doing something like:-
if b_det:
auth_address = b_det[0].author.author_address
Then this does not make use of the original database 'some_db' that I had specified early on for the main query. This again goes through the routers and picks up (possibly) the incorrect database.
Why does django do this. IMHO , if I had selected forced usage of database for the original query then even for the subquery the same database needs to be used. Why must the database routers come into picture for this at all?
This is not a subquery in the strict SQL sense of the word. What you are actually doing here is to execute one query and use the result of that to find related items.
You can chain filters and do lots of other operations on a queryset but it will not be executed until you take a slice on it or call .values() but here you are actually taking a slice
auth_address = b_det[0].#rest of code
So you have a materialized query and you are now trying to find the address of the related author and that requires another query but you are not using with so django is free to choose which database to use. You cacn overcome this by using select_related

Django .order_by() with .distinct() using postgres

I have a Read model that is related to an Article model. What I would like to do is make a queryset where articles are unique and ordered by date_added. Since I'm using postgres, I'd prefer to use the .distinct() method and specify the article field. Like so:
articles = Read.objects.order_by('article', 'date_added').distinct('article')
However this doesn't give the desired effect and orders the queryset by the order they were created. I am aware of the note about .distinct() and .order_by() in Django's documentation, but I don't see that it applies here since the side effect it mentions is there will be duplicates and I'm not seeing that.
# To actually sort by date added I end up doing this
articles = sorted(articles, key=lambda x: x.date_added, reverse=True)
This executes the entire query before I actually need it and could potentially get very slow if there are lots of records. I've already optimized using select_related().
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
UPDATE
The output would ideally be a queryset of Read instances where their related article is unique in the queryset and only using the Django orm (i.e. sorting in python).
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
Possibily. It's hard to say without the full picture, but my assumption is that you are using Read to track which articles have and have not been read, and probably tying this to User instance to determine if a particular user has read an article or not. If that's the case, your approach is flawed. Instead, you should do something like:
class Article(models.Model):
...
read_by = models.ManyToManyField(User, related_name='read_articles')
Then, to get a particular user's read articles, you can just do:
user_instance.read_articles.order_by('date_added')
That takes the need to use distinct out of the equation, since there will not be any duplicates now.
UPDATE
To get all articles that are read by at least one user:
Article.objects.filter(read_by__isnull=False)
Or, if you want to set a threshold for popularity, you can use annotations:
from django.db.models import Count
Article.objects.annotate(read_count=Count('read_by')).filter(read_count__gte=10)
Which would give you only articles that have been read by at least 10 users.