Optimization of big queryset/context in django

Optimization of big queryset/context in django - django

I have to present a very complex page with a lot of data coming from 3 different tables related with ForeignKey and ManyToManyField... I was able to do what I want but performance is terrible and I'm stuck trying to find a better approach... here are detailed code:
Models:
class CATSegmentCollection(models.Model):
theFile = models.ForeignKey('app_file.File', related_name='original_file')
segmentsMT = models.ManyToManyField('app_mt.MachineTransTable', related_name='segMT', blank=True,)
segmentsTM = models.ManyToManyField('app_tm.TMTable', related_name='segTM', blank=True, through='app_cat.TM_Source_quality',)
...
class TM_Source_quality(models.Model):
catSeg = models.ForeignKey('app_cat.CATSegmentCollection')
tmSeg = models.ForeignKey('app_tm.TMTable')
quality = models.IntegerField()
class MachineTransTable(models.Model):
mt = models.ForeignKey('app_mt.MT_available', blank=True, null=True, )
...
class TMTable(models.Model):
...
From these models (I just wrote what is relevant to my problem) I present all the CATSegmentCollection entries related to a single file... together with its associated TM and MT segments. In other words each entry in CATSegmentCollection has zero or more TM segment from the TMTable table and zero or more MT segment from the MachineTransTable table.
This is what I do in the ListView (and I use AjaxListView because I'm using a infinite scrolling pagination from django-el-pagination):
class CatListView(LoginRequiredMixin, AjaxListView):
Model = CATSegmentCollection
template_name = 'app_cat/cat.html'
page_template='app_cat/cat_page.html'
def get_object(self, queryset=None):
obj = File.objects.get(id=self.kwargs['file_id'])
return obj
def get_queryset(self):
theFile = self.get_object()
return CATSegmentCollection.objects.filter(theFile=theFile).prefetch_related('segmentsMT').prefetch_related('segmentsTM').order_by('segment_order')
def get_context_data(self, **kwargs):
context = super(CatListView, self).get_context_data(**kwargs)
contextSegment = []
myCatCollection = self.get_queryset()
theFile = self.get_object()
context['file'] = theFile
for aSeg in myCatCollection:
contextTarget = []
if aSeg.segmentsTM.all():
for aTargetTM in aSeg.tm_source_quality_set.all():
percent_quality = ...
contextTarget.append( {
"source" : aTargetTM.tmSeg.source,
"target" : aTargetTM.tmSeg.target,
"quality" : str(percent_quality) + '%',
"origin" : "TM",
"orig_name" : aTargetTM.tmSeg.tm_client.name,
"table_id" : aTargetTM.tmSeg.id,
})
if aSeg.segmentsMT.all():
for aTargetMT in aSeg.segmentsMT.all():
contextTarget.append( {
"target" : aTargetMT.target,
"quality" : "",
"origin" : "MT",
"orig_name" : aTargetMT.mt.name,
"table_id" : aTargetMT.id
})
contextSegment.append( {
"id" : aSeg.id,
"order" : aSeg.segment_order,
"source" : aSeg.source,
"target" : contextTarget,
})
context['segments'] = contextSegment
return context
Everything works but:
I hit the DB each time i call aSeg.segmentsTM.all() and aSeg.segmentsMT.all() because I guess the prefetch is not preventing it... this result in hundreds of duplicated queries
All these queries are repeated each time I load more entries from the paginations (in other words... each time more entries are presented because of scrolling, the full set of entries are requested... I tried also using lazy_paginate but nothing changes)
In principle all the logic I have in get_context_data (there is more but I just presented the essential code) could be reproduced in the template passing just the queryset... or by the client with a lot of jquery/javascript code but I don't think it's a good idea to proceed like this...
So my question is... I can optimize this code reducing the number of DB hits and the time to produce the response? Just to give you an idea a relative small size file (with 300 entries in the CATSegmentCollection) load in 6.5 sec with 330 queries (more than 300 duplicated) taking 0.4 sec. The DJDT time analysis gives
domainLookup 273 (+0)
connect 273 (+0)
request 275 (+-1475922263356)
response 9217 (+-1475922272298)
domLoading 9225 (+-1475922272306)
Any suggestions?
Thanks

Optimizing number of queries is a quite tricky issue as pinpointing which exact code triggered that extra query is not obvious. So I would suggest to comment out all the code inside that for loop and start uncommenting it line by line while monitoring which exact line causes extra queries, and optimize it gradually.
A few observations:
You need to carefully declare all deep relations you touch inside prefetch_related, like:
.prefetch_related('segmentsTM', 'segmentsTM__tm_source_quality_set', 'segmentsTM__tm_source_quality_set__tmSeg', 'segmentsTM__tm_source_quality_set__tmSeg__tm_client', 'segmentsMT', 'segmentsMT__mt')
No need to check if aSeg.segmentsMT.all(): before looping over it as it would still return an empty iterable.
Unrelated note regarding related_name='segMT' in your CATSegmentCollection model. related_name field is used to declare how the current model should be accessed from the other side of a relation, so you would probably want something like related_name='cATSegmentCollections' for both fields
At the end you should be able to optimize it down to somewhere around 10 queries (around one for each relation). The success criteria is not having any numerous WHERE foreign_id=X queries and have only WHERE foreign_id IN (X,Y,...) type of queries.

Following serg suggestions I started to dig into the problem and at the end I was able to prefetch all the needed info. I guess that using a thorough table change the way prefetching works... Here is the correct queryset:
all_cat_seg = CATSegmentCollection.objects.filter(theFile=theFile).order_by('segment_order')
all_tm_source_quality_entries = TM_Source_quality.objects.filter(catSeg__in=all_cat_seg).select_related('tmSeg','tmSeg__tm_client')
prefetch = Prefetch('tm_source_quality_set',queryset=all_tm_source_quality_entries)
CATSegmentCollection.objects.filter(theFile=theFile).prefetch_related(
prefetch,
'segmentsMT',
'segmentsMT__mt'
).order_by('segment_order')
With this queryset, I was able to reduce the number of queries to 10...

Related

Simplifying Django Query Annotation

I've 3 models and a function is called many times, but it generates 200-300 sql queries, and I'd like to reduce this number.
I've the following layout:
class Info(models.Model):
user = models.ForeignKey(User)
...
class Forum(models.Model):
info = models.ForeignKey(Info)
datum = models.DateTimeField()
...
class InfoViewed(models.Model):
user = models.ForeignKey(User)
infokom = models.ForeignKey(Info)
last_seen = models.DateTimeField()
...
. I need to get all the number of new Forum messages, so only a number. At the moment it works so that I iterate over all the related Infokoms and I summerize all Forums having higher datum than the related InfoViewed's last_seen field.
This works, however results ~200 queries for 100 Infos.
Is there any possibility to fetch the same number within a few queries? I tried to play with annonate and django.db.models.Count, but I failed.
Django: 1.11.16
Currently I'm using this:
infos = Info.objects.filter(user_id=**x**)
return sum(i.number_of_new_forums(self.user) \
for i in infos)
and the number_of_new_forums looks like this:
info_viewed = self.info_viewed_set.filter(user=user)
return len([f.id for f in self.get_forums().\
filter(datum__gt = info_viewed[0].last_seen)])

I managed to come up with some solution from a different perspective:
Forum.objects.filter(info_id__in=info_ids,
info__info_viewed__user=request.user,
datum__gt=F('info__info_viewed__last_seen')).count()
where info_ids are all related Infos' id (list), however i'm unsure if it's a 100% procent solution... .
If someone might have a different approach I'd welcome it.

Queryset in Django if empty field returns all elements

I want to do a filter in Django that uses form method.
If the user type de var it should query in the dataset that var, if it is left in blank to should bring all the elements.
How can I do that?
I am new in Django
if request.GET.get('Var'):
Var = request.GET.get('Var')
else:
Var = WHAT SHOULD I PUT HERE TO FILTER ALL THE ELEMNTS IN THE CODE BELLOW
models.objects.filter(Var=Var)

It's not a great idea from a security standpoint to allow users to input data directly into search terms (and should DEFINITELY not be done for raw SQL queries if you're using any of those.)
With that note in mind, you can take advantage of more dynamic filter creation using a dictionary syntax, or revise the queryset as it goes along:
Option 1: Dictionary Syntax
def my_view(request):
query = {}
if request.GET.get('Var'):
query['Var'] = request.GET.get('Var')
if request.GET.get('OtherVar'):
query['OtherVar'] = request.GET.get('OtherVar')
if request.GET.get('thirdVar'):
# Say you wanted to add in some further processing
thirdVar = request.GET.get('thirdVar')
if int(thirdVar) > 10:
query['thirdVar'] = 10
else:
query['thirdVar'] = int(thirdVar)
if request.GET.get('lessthan'):
lessthan = request.GET.get('lessthan')
query['fieldname__lte'] = int(lessthan)
results = MyModel.objects.filter(**query)
If nothing has been added to the query dictionary and it's empty, that'll be the equivalent of doing MyModel.objects.all()
My security note from above applies if you wanted to try to do something like this (which would be a bad idea):
MyModel.objects.filter(**request.GET)
Django has a good security track record, but this is less safe than anticipating the types of queries that your users will have. This could also be a huge issue if your schema is known to a malicious site user who could adapt their query syntax to make a heavy query along non-indexed fields.
Option 2: Revising the Queryset
Alternatively, you can start off with a queryset for everything and then filter accordingly
def my_view(request):
results = MyModel.objects.all()
if request.GET.get('Var'):
results = results.filter(Var=request.GET.get('Var'))
if request.GET.get('OtherVar'):
results = results.filter(OtherVar=request.GET.get('OtherVar'))
return results

A simpler and more explicit way of doing this would be:
if request.GET.get('Var'):
data = models.objects.filter(Var=request.GET.get('Var'))
else:
data = models.objects.all()

Django ORM - LEFT JOIN with WHERE clause

I have made a previous post related to this problem here but because this is a related but new problem I thought it would be best to make another post for it.
I'm using Django 1.8
I have a User model and a UserAction model. A user has a type. UserAction has a time, which indicates how long the action took as well as a start_time which indicates when the action began. They look like this:
class User(models.Model):
user_type = models.IntegerField()
class UserAction:
user = models.ForeignKey(User)
time = models.IntegerField()
start_time = models.DateTimeField()
Now what I want to do is get all users of a given type and the sum of time of their actions, optionally filtered by the start_time.
What I am doing is something like this:
# stubbing in a start time to filter by
start_time = datetime.now() - datetime.timedelta(days=2)
# stubbing in a type
type = 2
# this gives me the users and the sum of the time of their actions, or 0 if no
# actions exist
q = User.objects.filter(user_type=type).values('id').annotate(total_time=Coalesce(Sum(useraction__time), 0)
# now I try to add the filter for start_time of the actions to be greater than or # equal to start_time
q = q.filter(useraction__start_time__gte=start_time)
Now what this does is of course is an INNER JOIN on UserAction, thus removing all the users without actions. What I really want to do is the equivalent of my LEFT JOIN with a WHERE clause, but for the life of me I can't find how to do that. I've looked at the docs, looked at the source but am not finding an answer. I'm (pretty) sure this is something that can be done, I'm just not seeing how. Could anyone point me in the right direction? Any help would be very much appreciated. Thanks much!

I'm having the same kind of problem as you. I haven't found any proper way of solving the problem yet, but I've found a few fixes.
One way would be looping through all the users:
q = User.objects.filter(user_type=type)
for (u in q):
u.time_sum = UserAction.filter(user=u, start_time__gte=start_time).aggregate(time_sum=Sum('time'))['time_sum']
This method does however a query at the database for each user. It might do the trick if you don't have many users, but might get very time-consuming if you have a large database.
Another way of solving the problem would be using the extra method of the QuerySet API. This is a method that is detailed in this blog post by Timmy O'Mahony.
valid_actions = UserAction.objects.filter(start_time__gte=start_time)
q = User.objects.filter(user_type=type).extra(select={
"time_sum": """
SELECT SUM(time)
FROM userAction
WHERE userAction.user_id = user.id
AND userAction.id IN %s
""" % (%s) % ",".join([str(uAction.id) for uAction in valid_actions.all()])
})
This method however relies on calling the database with the SQL table names, which is very un-Django - if you change the db_table of one of your databases or the db_column of one of their columns, this code will no longer work. It though only requires 2 queries, the first one to get the list of valid userAction and the other one to sum them to the matching user.

Django: ManyToMany filter matching on ALL items in a list

I have such a Book model:
class Book(models.Model):
authors = models.ManyToManyField(Author, ...)
...
In short:
I'd like to retrieve the books whose authors are strictly equal to a given set of authors. I'm not sure if there is a single query that does it, but any suggestions will be helpful.
In long:
Here is what I tried, (that failed to run getting an AttributeError)
# A sample set of authors
target_authors = set((author_1, author_2))
# To reduce the search space,
# first retrieve those books with just 2 authors.
candidate_books = Book.objects.annotate(c=Count('authors')).filter(c=len(target_authors))
final_books = QuerySet()
for author in target_authors:
temp_books = candidate_books.filter(authors__in=[author])
final_books = final_books and temp_books
... and here is what I got:
AttributeError: 'NoneType' object has no attribute '_meta'
In general, how should I query a model with the constraint that its ManyToMany field contains a set of given objects as in my case?
ps: I found some relevant SO questions but couldn't get a clear answer. Any good pointer will be helpful as well. Thanks.

Similar to #goliney's approach, I found a solution. However, I think the efficiency could be improved.
# A sample set of authors
target_authors = set((author_1, author_2))
# To reduce the search space, first retrieve those books with just 2 authors.
candidate_books = Book.objects.annotate(c=Count('authors')).filter(c=len(target_authors))
# In each iteration, we filter out those books which don't contain one of the
# required authors - the instance on the iteration.
for author in target_authors:
candidate_books = candidate_books.filter(authors=author)
final_books = candidate_books

You can use complex lookups with Q objects
from django.db.models import Q
...
target_authors = set((author_1, author_2))
q = Q()
for author in target_authors:
q &= Q(authors=author)
Books.objects.annotate(c=Count('authors')).filter(c=len(target_authors)).filter(q)

Q() & Q() is not equal to .filter().filter(). Their raw SQLs are different where by using Q with &, its SQL just add a condition like WHERE "book"."author" = "author_1" and "book"."author" = "author_2". it should return empty result.
The only solution is just by chaining filter to form a SQL with inner join on same table: ... ON ("author"."id" = "author_book"."author_id") INNER JOIN "author_book" T4 ON ("author"."id" = T4."author_id") WHERE ("author_book"."author_id" = "author_1" AND T4."author_id" = "author_1")

I came across the same problem and came to the same conclusion as iuysal,
untill i had to do a medium sized search (with 1000 records with 150 filters my request would time out).
In my particular case the search would result in no records since the chance that a single record will align with ALL 150 filters is very rare, you can get around the performance issues by verifying that there are records in the QuerySet before applying more filters to save time.
# In each iteration, we filter out those books which don't contain one of the
# required authors - the instance on the iteration.
for author in target_authors:
if candidate_books.count() > 0:
candidate_books = candidate_books.filter(authors=author)
For some reason Django applies filters to empty QuerySets.
But if optimization is to be applied correctly however, using a prepared QuerySet and correctly applied indexes are necessary.

Improving Django performance with 350000+ regs and complex query

I have a model like this:
class Stock(models.Model):
product = models.ForeignKey(Product)
place = models.ForeignKey(Place)
date = models.DateField()
quantity = models.IntegerField()
I need to get the latest (by date) quantity for every product for every place,
with almost 500 products, 100 places and 350000 stock records on the database.
My current code is like this, it worked on testing but it takes so long with the real data that it's useless
stocks = Stock.objects.filter(product__in=self.products,
place__in=self.places, date__lt=date_at)
stock_values = {}
for prod in self.products:
for place in self.places:
key = u'%s%s' % (prod.id, place.id)
stock = stocks.filter(product=prod, place=place, date=date_at)
if len(stock) > 0:
stock_values[key] = stock[0].quantity
else:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
except IndexError:
stock_values[key] = 0
else:
stock_values[key] = stock.quantity
return stock_values
How would you make it faster?
Edit:
Rewrote the code as this:
stock_values = {}
for product in self.products:
for place in self.places:
try:
stock_value = Stock.objects.filter(product=product, place=place, date__lte=date_at)\
.order_by('-date').values('cant')[0]['cant']
except IndexError:
stock_value = 0
stock_values[u'%s%s' % (product.id, place.id)] = stock_value
return stock_values
It works better (from 256 secs to 64) but still need to improve it. Maybe some custom SQL, I don't know...

Arthur's right, the len(stock) isn't the most efficient way to do that. You could go further along the "easier to ask for forgiveness than permission" route with something like this inside the inner loop:
key = u'%s%s' % (prod.id, place.id)
try:
stock = stocks.filter(product=prod, place=place, date=date_at)[0]
quantity = stock.quantity
except IndexError:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
quantity = stock.quantity
except IndexError:
quantity = 0
stock_values[key] = quantity
I'm not sure how much that would improve it compared to just changing the length check, though I think this should at least restrict it to two queries with LIMIT 1 on them (see Limiting QuerySets).
Mind you, this is still performing a lot of database hits since you could run through that loop almost 50000 times. Optimize how you're looping and you're in a better position still.

maybe the trick is in that len() method!
follow docs from:
Note: Don't use len() on QuerySets if all you want to do is determine
the number of records in the set. It's much more efficient to handle a
count at the database level, using SQL's SELECT COUNT(*), and Django
provides a count() method for precisely this reason. See count()
below.
So try changing the len to count(), and see if it makes faster!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Optimization of big queryset/context in django - django

Related

Simplifying Django Query Annotation

Queryset in Django if empty field returns all elements

Django ORM - LEFT JOIN with WHERE clause

Django: ManyToMany filter matching on ALL items in a list

Improving Django performance with 350000+ regs and complex query

Categories

Resources