Django/Postgres: Performance issues with LIMIT querysets

Django/Postgres: Performance issues with LIMIT querysets - django

I have came across a strange query performance issue that I am struggling to understand.
The following is a simplified version of the model structure I have, hopefully it will be enough to illustrate the issue:
class Note(models.Model):
...
name = models.CharField(max_length=50)
parentNote = models.ForeignKey('self', null=True)
form = models.ForeignKey('NoteForm', null=True)
...
class Event(Note):
...
startDate = models.DateField()
...
class Activity(Event):
...
The Activity model is the source of the issue I am facing. It has an extensive inheritance heirarchy, none of which is abstract. I do not know if this contributes to the issue. Activity has ~280000 records and, obviously, its parents have at least that many, if not more.
The NoteForm model is not described above - it is only necessary to know that it is external to the Activity model's hierarchy and contains less than 100 records.
I am using Django version 1.3.
The problem occurs when querying for the latest "child" Activity of some parent Activity. The query filters by the parentNote field, orders by the 'startDate' field (descending) and uses Python's index notation to select the first result (which, by my understanding, simply adds LIMIT 1 to the generated SQL). See below for the code.
This query runs unexpectedly slowly when no results are found - 10+ seconds. If results are found, it runs as expected - well under 1 second.
Further investigation revealed the following:
It is the limit causing the issue. Just doing the filter, without limiting to the first result, is not slow - whether results are found or not.
Ordering is partially a culprit. Removing the ordering removes the issue.
The parentNote filter is partially a culprit. Changing the filter to use the form or name field removes the issue.
In code:
# Original - SLOW
try:
latest = Activity.objects.filter(
parentNote=activity.pk
).order_by('-startDate')[0]
except IndexError:
latest = None
# FAST
# No limit
Activity.objects.filter(
parentNote=activity.pk
).order_by('-startDate')
# No ordering
try:
latest = Activity.objects.filter(
parentNote=activity.pk
)[0]
except IndexError:
latest = None
# Different filter
try:
latest = Activity.objects.filter(
form=activity.pk
).order_by('-startDate')[0]
except IndexError:
latest = None
# Different filter
try:
latest = Activity.objects.filter(
name=activity.pk
).order_by('-startDate')[0]
except IndexError:
latest = None
If the issue is at the database level, I can't see it. I've run the "Original" and "No Limit" examples from above in the django-debug-toolbar's debugsqlshell. The "Original" took 16 seconds and "No Limit" took 59ms. I copied both queries printed by the debugsqlshell and ran them in pgAdmin. "Original" took 1375ms and "No Limit" took 94ms. So it is slower, but not by the amount I'm seeing using the ORM. EXPLAIN ANALYZE definitely shows the query analyzer taking different paths, which I completely understand. But I cannot reproduce the 16 second query using SQL directly.
So, in summary:
I am seeing LIMIT queries running far slower than identical queries without the LIMIT, but only when no results are found.
Queries that return results do not run slowly - and they are identical apart from the values of the filters.
It appears to be a function of which fields are included in the filters, and whether or not the queryset is ordered.
It does NOT appear to be a database level issue as running the SQL directly does not run slowly.
Update:
While trying suggestions made in the comments, the above examples suddenly ceased suffering from this issue - before I found any evidence as to the cause, let alone implemented a fix. I still have no idea what the problem was, but now I do not have a means to reproduce it in order to further investigate.

Related

Which is a more efficient method, using a list comprehension or django's 'values_list' function?

When attempting to return a list of values from django objects, will performance be better using a list comprehension:
[x.value for x in Model.objects.all()]
or calling list() on django's values_list function:
list(Model.objects.values_list('value', flat=True))
and why?

The most efficient way is to do the second approach (using values_list()). The reason for this is that this modifies the SQL query that is sent to the database to only select the values provided.
The first approach FIRST selects all values from the database, and after that filters them again. So you have already "spend" the resources to fetch all values with that approach.
You can compare the queries generated by wrapping your QuerySet with str(queryset.query) and it will return the actual SQL query that gets executed.
See example below
class Model(models.Model):
foo = models.CharField()
bar = models.CharField()
str(Model.objects.all().query)
# SELECT "model"."id", "model"."foo", "model"."bar" FROM "model"
str(Model.objects.values_list("foo").query)
# SELECT "model"."foo" FROM "model"

I had also somewhat assumed the argument in the currently-accepted answer would be correct. Namely, having a fewer number of fields being fetched would lead to Model.objects.all() taking less time than Model.objects.values_list('foo') to execute. However, I didn't find this in practice when using %timeit.
I actually found that doing
Model.objects.values_list('foo', flat=True) would take ~2-10x longer than just Model.objects.all(). I found this was the case for
an empty django table
a table with 10s of rows
a table with millions of rows
Including/removing flat=True seemed to make no significant difference in executing time for values_list. I would be interested what others find as well?
So this makes me think from a pure "what SQL is executed" point of view, although the values_list ORM query fetches fewer field values from the db, I imagine there is more logic still within the source django code of .all() vs .values_list() which could lead to different additional execution times (including .all() taking less time).
However, to fully address the initial example code, we would also need to factor in any further considerations affecting the execution time due to using a list comprehension [] in the .all() case VS list() in the .values_list() case. The general discussion of list() VS a list comprehension is covered in other questions already.
TLDR So I imagine it is a trade-off between those 2 factors.
the apparent difference in execution time between .values_list() and .all() (which from my tests indicate we can't simply deduce fewer fields being fetched leads to faster execution - more investigation of underlying django source code needed for cause of this)
any differences between using a list comprehension and list()
In my test cases, I generally found the .all() query was actually faster than the .values_list() query, but when also factoring in the transformation to a list, the .values_list scenario would overall take less time. So it may well depend on the scenario...

Reevoo API filtering

I am using Python to query the Reevoo API. As far as I can tell, the options for filtering are somewhat limited and the docs are an exhaustive list of what query parameters you can use. I was wondering if anybody had found a way to filter customer experience reviews with a date range?
Currently my hack solution is to use a generator which calls the API page by page and yields the review if its publish_date is after a certain date, which is obviously really inefficient. It doesn't help that the API returns the results slightly out of order, so I can't break/return as soon as I find one review that's out of range.
for i in range(number_of_pages, 0, -1):
# API call wrapper
page_of_reviews = self.reevoo.get_customer_experience_review_list(self.trkref, older_reviews=True,
page=i, per_page=30)
page_of_reviews = json.loads(page_of_reviews.text.replace('\r\n', ''))
customer_experience_reviews = page_of_reviews.get('customer_experience_reviews')
processed_reviews = self.process_customer_experience_reviews(customer_experience_reviews)
for item in processed_reviews['review_list']:
if from_dt:
if datetime.strptime(item['publish_date'], '%Y-%m-%d') >= datetime.strptime(from_dt, '%Y-%m-%d'):
yield item
else:
yield item
I've scoured the docs and Reevoo's GitHub page and haven't found anything, but in the hopes that some random person on the Internet has found a workaround... Does anyone have any ideas?

I emailed Reevoo to ask about date filtering and the short answer is that there is no way to filter or sort by date.
Explanation from the email:
Unfortunately, we cannot filter reviews by date as when we display the reviews, they are not necessarily in date order. For example, reviews with written content come before those which don't have written content as they have more value to the consumer. We would also prefer that you refreshed everything at least once a day, because older reviews sometimes have to be renewed or customers may sometimes request that there reviews be amended.
I understand why you would lie to do date filtering but at the moment, if you are caching reviews on your server, this is the way we prefer you to do it.

Using django prefetch_related() to get the time of last activity

I upgraded to Django 1.7 so I could get Prefetch objects, but I'm having a hard time getting them to behave as expected.
I have an Employee model like this:
class Employee(Human):
... additional Employee Fields ...
def get_last_activity_date(self):
try:
return self.activity_set.all().order_by('-when')[0:1].get().when
except Activity.DoesNotExist:
return None
and activities like this
class Activity(models.Model):
when = models.DateTimeField()
employee = models.ForeignKey(Employee, related_name='activity_set')
I want to use prefetch_related to get the last activity date for this employee. I've tried to express this many ways, but no matter how I do it, it ends up generating another query. My other 2 my prefetch_related parts work as expected, but this one does not ever seem to save me any queries.
I'm using this with Django Rest Framework, so I really need the prefetch_related part to work since I have no way of reaching inside DRF to do the mapping outside of the queryset.
Here is one of the ways that DOES NOT WORK
def get_queryset(self):
return super(EmployeeViewSet, self).get_queryset()\
.prefetch_related('phone_number_set', 'email_address_set')\
.prefetch_related(Prefetch('activity_set', Activity.objects.all().order_by('-when')))\
.order_by('last_name', 'first_name')
Notice that on the activity_set prefetch query that I can't slice to only get the latest entry either which is a concern in terms of how much memory this is going to eat up.
I do actually see the prefetch query take place, but then each employee gets a separate query for that piece of information, meaning I have one bigger wasted query and still get the ~200 querys I'm trying to prevent.
How do you get the prefetch_related to work for me in this case?

I suspect You are missing the point of prefetch_related. The docs state that this is the expected behavior: many queries and the 'joining' in python. If you want less queries you should use select_related Also I'm not sure It would work with your specific models (not stated in the question) since select_related does work well with many to many relations.
UPDATE - the docs:
prefetch_related, on the other hand, does a separate lookup for each relationship, and does the ‘joining’ in Python

Django Query extremely slow

I have a problem with a Django application. Queries on the model Scope are extremely slow and after some debugging I still have no clue where the problem lies.
When I query the db like scope = Scope.objects.get(pk='Esoterik I') it takes 5 to 10 seconds. The database has less than 10 entries and an index on the primary key, so it is way too slow. When executing the an equivalent query on the db like SELECT * FROM scope WHERE title='Esoterik I'; everything is ok and it takes only about 50ms.
The same problem happens if I query a set of results like scope_list = Scope.objects.filter(members=some_user) and then call print(scope_list) or iterate over the list elements. The query itself only takes a few ms but the print or iterating of the elements takes again like 5 to 10 seconds but the set has only two entries.
Database Backend is Postgresql. The same problem occurs on the local development server and apache.
Here the code of the model:
class Scope(models.Model):
title = models.CharField(primary_key=True, max_length=30)
## the semester the scope is linked with
assoc_semester = models.ForeignKey(Semester, null=True)
## the grade of the scope. can be Null if the scope is not a class
assoc_grade = models.ForeignKey(Grade, null=True)
## the timetable of the scope. can be null if the scope is not direct associated with a class
assoc_timetable = models.ForeignKey(Timetable, null=True)
## the associated subject of the scope
assoc_subject = models.ForeignKey(Subject)
## the calendar of the scope
assoc_calendar = models.ForeignKey(Calendar)
## the usergroup of the scope
assoc_usergroup = models.ForeignKey(Group)
members = models.ManyToManyField(User)
unread_count = None
update
Here is the output of the python profiler. It seems that query.py was getting called 1.6 million times - a little too much.

You should try and first isolate the problem. Run manage.py shell and run the following:
scope = Scope.objects.get(pk='Esoterik I')
print scope
Now django queries are not executed until they very much have to. That is to say, if you're experiencing slowness after the first line, the problem is somewhere in the creation of the query which would suggest problems with the object manager. The next step would be to try and execute raw SQL through django, and make sure the problem is really with the manager and not a bug in django in general.
If you're experiencing slowness with the second line, the problem is eitherwith the actual execution of the query, or with the display\printing of the data. You can force-execute the query without printing it (check the documentation) to find out which one it is.
That's as far as I understand but I think the best way to solve this is to break the process down to different parts and finding out which part is the one causing the slowness

For being sure about the database execution time, it is better to test queries generated by Django since Django-generated queries may not be a simple SELECT * from blah blah
To see the Django generated query:
_somedata = Scope.objects.filter(pk='Esoterik I') # you must use filter in here
print somedata.query.__format__('')
This will display you the complete query generated by Django. Then copy it and open a Postgresql console and use Postgresql analyze tools:
EXPLAIN ANALYZE <your django query here>;
like:
EXPLAIN ANALYZE SELECT * FROMsomeapp_scope WHERE id = 'Esoterik I';
EXPLAIN will show average execution data while ANAYLZE will also show you some extra data about execution time of that analyze.
You can also see if any index is used by postgresql during query execution in those analyze results.

Designing a database for a user/points system? (in Django)

First of all, sorry if this isn't an appropriate question for StackOverflow. I've tried to make it as generalisable as possible.
I want to create a database (MySQL, site running Django) that has users, who can be allocated a certain number of points for various types of action - it's a collaborative game. My requirements are to obtain:
the number of points a user has
the user's ranking compared to all other users
and the overall leaderboard (i.e. all users ranked in order of points)
This is what I have so far, in my Django models.py file:
class SiteUser(models.Model):
name = models.CharField(max_length=250 )
email = models.EmailField(max_length=250 )
date_added = models.DateTimeField(auto_now_add=True)
def points_total(self):
points_added = PointsAdded.objects.filter(user=self)
points_total = 0
for point in points_added:
points_total += point.points
return points_total
class PointsAdded(models.Model):
user = models.ForeignKey('SiteUser')
action = models.ForeignKey('Action')
date_added = models.DateTimeField(auto_now_add=True)
def points(self):
points = Action.objects.filter(action=self.action)
return points
class Action(models.Model):
points = models.IntegerField()
action = models.CharField(max_length=36)
However it's rapidly becoming clear to me that it's actually quite complex (in Django query terms at least) to figure out the user's ranking and return the leaderboard of users. At least, I'm finding it tough. Is there a more elegant way to do something like this?
This question seems to suggest that I shouldn't even have a separate points table - what do people think? It feels more robust to have separate tables, but I don't have much experience of database design.

this is old, but I'm not sure exactly why you have 2 separate tables (Points Added & Action). It's late, so maybe my mind isn't ticking, but it seems like you just separated one table into 2 for some reason. It doesn't seem like you get any benefit out of it. It's not like there's a 1 to many relationship in it right?
So first of all, I would combine those two tables. Secondly, you are probably better off storing points_total into a value in your site_user table. This is what I think Demitry is trying to allude to, but didn't say explicitly. This way instead of doing this whole additional query (pulling everything a user has done in his history of the site is expensive) + looping action (going through it is even more expensive), you can just pull it as one field. It's denormalizing the data for a greater good.
Just be sure to update the value everytime you add in something that has points. You can use django's post_save signal to do that

It's a bit more difficult to have points saved in the same table, but it's totally worth it. You can do very simple ordering/filtering operations if you have computed points total on user model. And you can count totals only when something changes (not every time you want to show them). Just put some validation logic into post_save signals and make sure to cover this logic with tests and you're good.
p.s. denormalization on wiki.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Django/Postgres: Performance issues with LIMIT querysets - django

Related

Which is a more efficient method, using a list comprehension or django's 'values_list' function?

Reevoo API filtering

Using django prefetch_related() to get the time of last activity

Django Query extremely slow

Designing a database for a user/points system? (in Django)

Categories

Resources