Django pagination query duplicated, double the time - django

In my current project I want to do some filtering and ordering on a queryset and show it to the user in a paginated form.
This works fine, however I am not comfortable with the performance.
When I use and order_by statement either explicitly or implicitly with the model Meta ordering, I can see in the Debug toolbar that this query is essentially executed twice.
Once for the paginator count (without the ORDER BY) and once to fetch the objects slice (with ORDER BY).
From my observation this leads to doubling the time it takes.
Is there any way this can be optimized?
Below is a minimal working example, in my actual app I use class based views.
class Medium(models.Model):
title = models.CharField(verbose_name=_('title'),
max_length=256,
null=False, blank=False,
db_index=True,
)
offered_by = models.ForeignKey(Institution,
verbose_name=_('Offered by'),
on_delete=models.CASCADE,
)
quantity = models.IntegerField(verbose_name=_('Quantity'),
validators=[
MinValueValidator(0)
],
null=False, blank=False,
)
deleted = models.BooleanField(verbose_name=_('Deleted'),
default=False,
)
def index3(request):
media = Medium.objects.filter(deleted=False, quantity__gte=0)
media = media.exclude(offered_by_id=request.user.institution_id)
media = media.filter(title__icontains="funktion")
media = media.order_by('title')
paginator = Paginator(media, 25)
media = paginator.page(1)
return render(request, 'media/empty2.html', {'media': media})
Debug toolbar sql timings

The query is not exactly duplicated: One is a COUNT query, the other one fetches the actual objects for the specific page requested. This is unavoidable, since Django's Paginator needs to know the total number of objects. However, if the queryset media isn't too large, you can optimise by forcing the media Queryset to be evaluated (just add a line len(media) before you define the Paginator).
But note that if media is very large, you might not want to force media to be evaluated as you're loading all the objects into memory.

Related

Django query, annotate a chain of related models

I have following schema with PostgreSQL.
class Video(models.Model):
title = models.CharField(max_length=255)
created_at = models.DateTimeField()
disabled = models.BooleanField(default=False)
view_count = DecimalField(max_digits=10, decimal_places=0)
class TopVideo(models.Model):
videos = (Video, on_delete=models.CASCADE, primary_key=True)
class Comment(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
video = models.ForeignKey(Video, related_name="comments", on_delete=models.CASCADE)
The reason I have a TopVideo model is because I have millions of videos and querying them takes a long time on a cheap server, so I have a secondary model that is populated by a celery task, and flushes and re-populates on each run, which makes the homepage load time much faster. The task runs the query that you see next, and saves them into the TopVideo model. This way, the task may take long to run, but user doesn't have to wait for the expensive query anymore.
Before having the TopVideo model, I ran this query for my homepage:
videos = (
Video.objects.filter(created_at__range=[start, end])
.annotate(comment_count=Count("comments"))
.exclude(disabled=True)
.order_by("-view_count")[:100]
)
This worked perfectly and I had access to "comment_count" in my template, where I could easily show the number of comments each video had.
But now that I make this query:
top_videos = (
TopVideo.objects.all().annotate(comment_count=Count("video__comments"))
.select_related("video")
.order_by("-video__view_count")[:100]
)
and with a simple for-loop,
videos = []
for video in top_videos:
videos.append(video.video)
I send the videos to the template to render.
My problem is, I no longer have access to the "comment_count" inside the template, and naturally so; I don't send the queryset anymore. How can I now access the comment_count?
Things I tried:
Sending the TopVideo query to template did not work. They're a bunch of TopVideo objects, not Video objects.
I added this piece of code in my template "{{ video.comments.count }}" but this makes 100 requests to the database, which is not really optimal.
You can set the .comment_count to your Video objects with:
videos = []
for top_video in top_videos:
video = top_video.video
video.comment_count = top_video.comment_count
videos.append(video)
but that being said, it is unclear to my why you are querying with TopVideo if you basically strip the TopVideo context from the video.
If you want to obtain the Videos for which there exists a TopVideo object, you can work with:
videos = Video.objects.filter(
created_at__range=[start, end], topvideo__isnull=False
).annotate(
comment_count=Count('comments')
).exclude(disabled=True).order_by('-view_count')[:100]
The topvideo__isnull=False will thus filter out Videos that are not TopVideos.

Django - prefetch_related GenericForeignKey results and sort them

I have the below structure, where content modules, which are subclassed from a common model, are attached to pages via a 'page module' model that references them via a GenericForeignKey:
class SitePage(models.Model):
title = models.CharField()
# [etc..]
class PageModule(models.Model):
page = models.ForeignKey(SitePage, db_index=True, on_delete=models.CASCADE)
module_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
module_id = models.PositiveIntegerField()
module_object = GenericForeignKey("module_type", "module_id")
class CommonModule(models.Model):
published_time = models.DateTimeField()
class SingleImage(CommonModule):
title = models.CharField()
# [etc..]
class Article(CommonModule):
title = models.CharField()
# [etc..]
At the moment, populating pages from this results in a LOT of SQL queries. I want to fetch all the module contents (i.e. all the SingleImage and Article instances) for a given page in the most database-efficient manner.
I can't just do a straight prefetch_related because it "must be restricted to a homogeneous set of results", and I'm fetching multiple content types.
I can get each module type individually:
image_modules = PageModule.objects.filter(page=whatever_page, module_type=ContentType.objects.get_for_model(SingleImage)).prefetch_related('module_object_')
article_modules = PageModule.objects.filter(page=whatever_page, module_type=ContentType.objects.get_for_model(Article)).prefetch_related('module_object')
all_modules = image_modules | article_modules
But I need to sort them:
all_modules.order_by('module_object__published_time')
and I can't because:
"Field 'module_object' does not generate an automatic reverse relation
and therefore cannot be used for reverse querying"
... and I don't think I can add the recommended GenericRelation field to all the content models because there's already content in there.
So... can I do this at all? Or am I stuck?
Following the advice in the comments above I eventually arrived at this code (from 2012!) that has roughly halved the number of queries:
https://gist.github.com/justinfx/3095246
However, as I noted above, it's done that at the expense of creating some fairly inefficient WHERE pk IN() queries, so I've not actually saved much time in total.

Optimise query in a model method

I have a fairly simple model that's part of a double entry book keeping system. Double entry just means that each transaction (Journal Entry) is made up of multiple LineItems. The Lineitems should add up to zero to reflect the fact that money always comes out of one category (Ledger) and into another. The CR column is for money out, DR is money in (I think the CR and DR abreviations come from some Latin words and is standard naming convention in accounting systems).
My JournalEntry model has a method called is_valid() which checks that the line items balance and a few other checks. However the method is very database expensive, and when I use it to check many entries at once the database can't cope.
Any suggestions on how I can optimise the queries within this method to reduce database load?
class JournalEntry(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.PROTECT, null=True, blank=True)
date = models.DateField(null=False, blank=False)
# Make choiceset global so that it can be accessed in filters.py
global JOURNALENRTY_TYPE_CHOICES
JOURNALENRTY_TYPE_CHOICES = (
('BP', 'Bank Payment'),
('BR', 'Bank Receipt'),
('TR', 'Transfer'),
('JE', 'Journal Entry'),
('YE', 'Year End'),
)
type = models.CharField(
max_length=2,
choices=JOURNALENRTY_TYPE_CHOICES,
blank=False,
null=False,
default='0'
)
description = models.CharField(max_length=255, null=True, blank=True)
def __str__(self):
if self.description:
return self.description
else:
return 'Journal Entry '+str(self.id)
#property
def is_valid(self):
"""Checks if Journal Entry has valid data integrity"""
# NEEDS TO BE OPTIMISED AS PERFORMANCE IS BAD
cr = LineItem.objects.filter(journal_entry=self.id).aggregate(Sum('cr'))
dr = LineItem.objects.filter(journal_entry=self.id).aggregate(Sum('dr'))
if dr['dr__sum'] != cr['cr__sum']:
return "Line items do not balance"
if self.lineitem_set.filter(cr__isnull=True,dr__isnull=True).exists():
return "Empty line item(s)"
if self.lineitem_set.filter(cr__isnull=False,dr__isnull=False).exists():
return "CR and DR vales present on same lineitem(s)"
if (self.type=='BR' or self.type=='BP' or self.type=='TR') and len(self.lineitem_set.all()) != 2:
return 'Incorrect number of line items'
if len(self.lineitem_set.all()) == 0:
return 'Has zero line items'
return True
class LineItem(models.Model):
journal_entry = models.ForeignKey(JournalEntry, on_delete=models.CASCADE)
ledger = models.ForeignKey(Ledger, on_delete=models.PROTECT)
description = models.CharField(max_length=255, null=True, blank=True)
project = models.ForeignKey(Project, on_delete=models.SET_NULL, null=True, blank=True)
cr = models.DecimalField(max_digits=8, decimal_places=2, null=True, blank=True)
dr = models.DecimalField(max_digits=8, decimal_places=2, null=True, blank=True)
reconciliation_date = models.DateField(null=True, blank=True)
#def __str__(self):
# return self.description
class Meta(object):
ordering = ['id']
First thing first: if it's an expansive operation, it shouldn't be a property - not that it will change the execution time / db load, but at least it doesn't break the expectation that you're mostly doing a (relatively cheap) attribute access.
wrt/ possible optimisations, part of the cost is in the db roundtrip (including the time spent in the Python code - ORM and db adapter - itself) so a first thing would be to make as few queries as possible :
1/ replacing len(self.lineitem_set.all()) with self.lineitem_set.count() and avoiding calling it twice could save some time already
2/ you could probably regroup the first two queries in a single one (not tested...)
crdr = self.lineitem_set.aggregate(Sum('cr'), Sum('dr'))
if crdr['dr__sum'] != crdr['cr__sum']:
return "Line items do not balance"
and well, that's about all the simple obvious optimisations, and I don't think it will really solve your issue.
Next step would probably be to try a stored procedure that would do all the validation process - one single roundtrip and possibly more room for db-level optimisations (depending on your db vendor).
Then - assuming your db schema, settings, server etc are fully optimized (which is a bit outside this site's on-topic policy) -, the only solution left is denormalization, either at the db-level (safer) or at django level using a local per-instance cache on your model - the issue being to make sure you properly invalidate this cache everytime anything that might affect it changes.
NB actually I'm a bit surprised your db "can't cope" with this, at it doesn't seem _that_ heavy - but it of course depends on how many lineitems per journal you have (on average and worst case) in your production data.
More infos about your choosen rbdms, setup (same server or distinct one, and if yes network connectivity between the servers, available RAM, rdbms settings, etc) could probably help too - even with the most optimized queries at the client level, there are limits to what your rdbms can do... but then this becomes more of sysadmin/dbadmin question
EDIT
Page load time is now long but it does complete. Yes 2000 records to list and execute the method on
You mean you're executing this in a view on a 2000+ records queryset ? Well I can well understand it's a bit heavy - and not only on the database FWIW.
I think you might be able to optimize this quite further for this use case then. First option would be to make use of the queryset's select_related, prefetch_related, annotate and extra features, and if it's not enough to go for raw sql.

Django: Distinct on forgin key relationship

I'm working on a Ticket/Issue-tracker in django where I need to log the status of each ticket. This is a simplification of my models.
class Ticket(models.Model):
assigned_to = ForeignKey(User)
comment = models.TextField(_('comment'), blank=True)
created = models.DateTimeField(_("created at"), auto_now_add=True)
class TicketStatus(models.Model):
STATUS_CHOICES = (
(10, _('Open'),),
(20, _('Other'),),
(30, _('Closed'),),
)
ticket = models.ForeignKey(Ticket, verbose_name=_('ticket'))
user = models.ForeignKey(User, verbose_name=_('user'))
status = models.IntegerField(_('status'), choices=STATUS_CHOICES)
date = models.DateTimeField(_("created at"), auto_now_add=True)
Now, getting the status of a ticket is easy sorting by date and retrieving the first column like this.
ticket = Ticket.objects.get(pk=1)
ticket.ticketstatus_set.order_by('-date')[0].get_status_display()
But then I also want to be able to filter on status in the Admin, and those have to get the status trough a Ticket-queryset, which makes it suddenly more complex. How would I get a queryset with all Tickets with a certain status?
I guess you are trying to avoid a cycle (asking for each ticket status) to filter manually the queryset. As far as I know you cannot avoid that cycle. Here are ideas:
# select_related avoids a lot of hits in the database when enter the cycle
t_status = TicketStatus.objects.select_related('Ticket').filter(status = ID_STATUS)
# this is an array with the result
ticket_array = [ts.ticket for ts in tickets_status]
Or, since you mention you were looking for a QuerySet, this might be what you are looking for
# select_related avoids a lot of hits in the database when enter the cycle
t_status = TicketStatus.objects.select_related('Ticket').filter(status = ID_STATUS)
# this is a QuerySet with the result
tickets = Tickets.objects.filter(pk__in = [ts.ticket.pk for ts in t_status])
However, the problem might be in the way you are modeling the data. What you called TickedStatus is more like TicketStatusLog because you want to keep track of the user and date who change the status.
Therefore, the reasonable approach is to add a field 'current_status' to the Ticket model that is updated each time a new TicketStatus is created. In this way (1) you don't have to order a table each time you ask for a ticket and (2) you would simply do something like Ticket.objects.filter(current_status = ID_STATUS) for what I think you are asking.

Django model manager live-object issues

I am using Django. I am having a few issues with caching of QuerySets for news/category models:
class Category(models.Model):
title = models.CharField(max_length=60)
slug = models.SlugField(unique=True)
class PublishedArticlesManager(models.Manager):
def get_query_set(self):
return super(PublishedArticlesManager, self).get_query_set() \
.filter(published__lte=datetime.datetime.now())
class Article(models.Model):
category = models.ForeignKey(Category)
title = models.CharField(max_length=60)
slug = models.SlugField(unique = True)
story = models.TextField()
author = models.CharField(max_length=60, blank=True)
published = models.DateTimeField(
help_text=_('Set to a date in the future to publish later.'))
created = models.DateTimeField(auto_now_add=True, editable=False)
updated = models.DateTimeField(auto_now=True, editable=False)
live = PublishedArticlesManager()
objects = models.Manager()
Note - I have removed some fields to save on complexity...
There are a few (related) issues with the above.
Firstly, when I query for LIVE objects in my view via Article.live.all() if I refresh the page repeatedly I can see (in MYSQL logs) the same database query being made with exactly the same date in the where clause - ie - the datetime.datetime.now() is being evaluated at compile time rather than runtime. I need the date to be evaluated at runtime.
Secondly, when I use the articles_set method on the Category object this appears to work correctly - the datetime used in the query changes each time the query is run - again I can see this in the logs. However, I am not quite sure why this works, since I don't have anything in my code to say that the articles_set query should return LIVE entries only!?
Finally, why is none of this being cached?
Any ideas how to make the correct time be used consistently? Can someone please explain why the latter setup appears to work?
Thanks
Jay
P.S - database queries below, note the date variations.
SELECT LIVE ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE `news_article`.`published` <= '2011-05-17 21:55:41' ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
SELECT LIVE ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE `news_article`.`published` <= '2011-05-17 21:55:41' ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
CATEGORY SELECT ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE (`news_article`.`published` <= '2011-05-18 21:21:33' AND `news_article`.`category_id` = 1 ) ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
CATEGORY SELECT ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE (`news_article`.`published` <= '2011-05-18 21:26:06' AND `news_article`.`category_id` = 1 ) ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
You should check out conditional view processing.
def latest_entry(request, article_id):
return Article.objects.latest("updated").updated
#conditional(last_modified_func=latest_entry)
def view_article(request, article_id)
your view code here
This should cache the page rather than reloading a new version every time.
I suspect that if you want the now() to be processed at runtime, you should do use raw sql. I think this will solve the compile/runtime issue.
class PublishedArticlesManager(models.Manager):
def get_query_set(self):
return super(PublishedArticlesManager, self).get_query_set() \
.raw("SELECT * FROM news_article WHERE published <= CURRENT_TIMESTAMP")
Note that this returns a RawQuerySet which may differ a bit from a normal QuerySet
I have now fixed this issue. It appears the problem was that the queryset returned by Article.live.all() was being cached in my urls.py! I was using function-based generic-views:
url(r'^all/$', object_list, {
'queryset' : Article.live.all(),
}, 'news_all'),
I have now changed this to use the class-based approach, as advised in the latest Django documentation:
url(r'^all/$', ListView.as_view(
model=Article,
), name="news_all"),
This now works as expected - by specifying the model attribute rather than the queryset attribute the query is QuerySet is created at compile-time instead of runtime.