Running a lot of queries - django

I have the following models with its methods:
class TrendingTopic(models.Model):
categories = models.ManyToManyField('Category', through='TTCategory', blank=True, null=True)
location = models.ForeignKey(Location)
def get_rank(self, t_date=None):
if t_date:
ttcs = self.trendingtopiccycle_set.filter(cycle_time__gt=t_date)
else:
ttcs = self.trendingtopiccycle_set.all()
if ttcs:
return sum([ttc.rank for ttc in ttcs])/len(ttcs)
return 0
def get_day_rank(self,t_date):
ttcs = self.trendingtopiccycle_set.filter(cycle_time__year=t_date.year,
cycle_time__month=t_date.month,
cycle_time__day=t_date.day)
sum_rank = sum([ttc.day_rank for ttc in ttcs if ttc.day_rank])
if sum_rank:
return sum_rank/len(ttcs)
return 0
class TrendingTopicCycle(models.Model):
tt = models.ForeignKey(TrendingTopic)
cycle_time = models.DateTimeField(default=datetime.now)
from_tt_before = models.BooleanField(default=False)
rank = models.FloatField(default=0.0)
day_rank = models.FloatField(default=0.0)
And then I have some functions that are used in the views to retrieve the desired information:
Show the best trending topics of the current day:
def day_topics(tt_date, limit=10):
tts = [(ttc.tt, ttc.tt.get_day_rank(tt_date)) for ttc in \
TrendingTopicCycle.objects.distinct('tt__name') \
.filter(cycle_time__year=tt_date.year,
cycle_time__month=tt_date.month,
cycle_time__day=tt_date.day)]
sorted_tts = sorted(tts, key=itemgetter(1),
reverse=True)[:limit]
return sorted_tts
Show the best trending topics of a given location(woeid) within a determined lapse of time:
def hot_topics(woeid=None, limit=10):
CYCLE_LIMIT = datetime.now() + relativedelta(hours=-5)
TT_CYCLES_LIMIT = datetime.now() + relativedelta(days=-2)
if woeid:
tts = [ttc.tt for ttc in \
TrendingTopicCycle.objects.filter(tt__location__woeid=woeid) \
.distinct('tt__name') \
.exclude(cycle_time__lt=CYCLE_LIMIT)]
else:
tts = [ttc.tt for ttc in \
TrendingTopicCycle.objects.distinct('tt__name') \
.exclude(cycle_time__lt=CYCLE_LIMIT)]
sorted_tts = sorted(tts, key=lambda tt: tt.get_rank(TT_CYCLES_LIMIT),
reverse=True)[:limit]
return sorted_tts
The problem with the current solution is that it runs really slow because its doing a lot of queries(100's) to retrieve the data. I'm using the django debug toolbar to help me measuring the performance.
Obviously I'm doing something terribly wrong and I'm looking for a solution, any help would be greatly appreciated.
Edit:
Each trending topic has a set of trending topic cycle(ttc). Each ttc has two ranks: the general one(rank) and the day_rank. The trending topic ranks are calculated looping through each ttc.

First, be aware that django-debug-toolbar, while fantastic, is itself very slow. If you comment out its middleware, your response times improve dramatically. It's a very useful tool, don't get me wrong. I use it myself, religiously, but the point is you can't benchmark your site on something subjective like "slowness" while it's enabled.
Second, your code is a little confusing, so it's difficult to say exactly what you should do. For example, TrendingTopicCycle has a rank and day_rank field, but you never use them in the posted code. Calling get_day_rank issues a query each time, so it would obviously be more efficient if you could just filter on the day_rank field itself (eliminating the need for that query), but I cannot tell from the code you have here if or when those fields are actually set.
A small improvement you can make to the code as is, is judiciously using select_related. For instance, each time ttc.tt.get_day_rank(tt_date)) is run in the list comprehension, a query is issued to get tt and then another query is issued in get_day_rank. Simply adding .select_related('tt') to your queryset would at least eliminate that query for tt.
Also, I'm not sure if it actually causes Django to issue a different query or not (and perhaps a more inefficient query), but regardless, there's no point in filtering individually for year, month, and day, just filter on the full date, i.e.:
TrendingTopicCycle.objects.distinct('tt__name') \
.filter(cycle_time=date)

if ttcs:
return sum([ttc.rank for ttc in ttcs])/len(ttcs)
return 0
This can be replaced by a db query. https://docs.djangoproject.com/en/dev/topics/db/aggregation/
something like:
ttcs.Aggregate(Sum('rank'))["sum__rank"]

Related

Simplifying Django Query Annotation

I've 3 models and a function is called many times, but it generates 200-300 sql queries, and I'd like to reduce this number.
I've the following layout:
class Info(models.Model):
user = models.ForeignKey(User)
...
class Forum(models.Model):
info = models.ForeignKey(Info)
datum = models.DateTimeField()
...
class InfoViewed(models.Model):
user = models.ForeignKey(User)
infokom = models.ForeignKey(Info)
last_seen = models.DateTimeField()
...
. I need to get all the number of new Forum messages, so only a number. At the moment it works so that I iterate over all the related Infokoms and I summerize all Forums having higher datum than the related InfoViewed's last_seen field.
This works, however results ~200 queries for 100 Infos.
Is there any possibility to fetch the same number within a few queries? I tried to play with annonate and django.db.models.Count, but I failed.
Django: 1.11.16
Currently I'm using this:
infos = Info.objects.filter(user_id=**x**)
return sum(i.number_of_new_forums(self.user) \
for i in infos)
and the number_of_new_forums looks like this:
info_viewed = self.info_viewed_set.filter(user=user)
return len([f.id for f in self.get_forums().\
filter(datum__gt = info_viewed[0].last_seen)])
I managed to come up with some solution from a different perspective:
Forum.objects.filter(info_id__in=info_ids,
info__info_viewed__user=request.user,
datum__gt=F('info__info_viewed__last_seen')).count()
where info_ids are all related Infos' id (list), however i'm unsure if it's a 100% procent solution... .
If someone might have a different approach I'd welcome it.

Optimization of big queryset/context in django

I have to present a very complex page with a lot of data coming from 3 different tables related with ForeignKey and ManyToManyField... I was able to do what I want but performance is terrible and I'm stuck trying to find a better approach... here are detailed code:
Models:
class CATSegmentCollection(models.Model):
theFile = models.ForeignKey('app_file.File', related_name='original_file')
segmentsMT = models.ManyToManyField('app_mt.MachineTransTable', related_name='segMT', blank=True,)
segmentsTM = models.ManyToManyField('app_tm.TMTable', related_name='segTM', blank=True, through='app_cat.TM_Source_quality',)
...
class TM_Source_quality(models.Model):
catSeg = models.ForeignKey('app_cat.CATSegmentCollection')
tmSeg = models.ForeignKey('app_tm.TMTable')
quality = models.IntegerField()
class MachineTransTable(models.Model):
mt = models.ForeignKey('app_mt.MT_available', blank=True, null=True, )
...
class TMTable(models.Model):
...
From these models (I just wrote what is relevant to my problem) I present all the CATSegmentCollection entries related to a single file... together with its associated TM and MT segments. In other words each entry in CATSegmentCollection has zero or more TM segment from the TMTable table and zero or more MT segment from the MachineTransTable table.
This is what I do in the ListView (and I use AjaxListView because I'm using a infinite scrolling pagination from django-el-pagination):
class CatListView(LoginRequiredMixin, AjaxListView):
Model = CATSegmentCollection
template_name = 'app_cat/cat.html'
page_template='app_cat/cat_page.html'
def get_object(self, queryset=None):
obj = File.objects.get(id=self.kwargs['file_id'])
return obj
def get_queryset(self):
theFile = self.get_object()
return CATSegmentCollection.objects.filter(theFile=theFile).prefetch_related('segmentsMT').prefetch_related('segmentsTM').order_by('segment_order')
def get_context_data(self, **kwargs):
context = super(CatListView, self).get_context_data(**kwargs)
contextSegment = []
myCatCollection = self.get_queryset()
theFile = self.get_object()
context['file'] = theFile
for aSeg in myCatCollection:
contextTarget = []
if aSeg.segmentsTM.all():
for aTargetTM in aSeg.tm_source_quality_set.all():
percent_quality = ...
contextTarget.append( {
"source" : aTargetTM.tmSeg.source,
"target" : aTargetTM.tmSeg.target,
"quality" : str(percent_quality) + '%',
"origin" : "TM",
"orig_name" : aTargetTM.tmSeg.tm_client.name,
"table_id" : aTargetTM.tmSeg.id,
})
if aSeg.segmentsMT.all():
for aTargetMT in aSeg.segmentsMT.all():
contextTarget.append( {
"target" : aTargetMT.target,
"quality" : "",
"origin" : "MT",
"orig_name" : aTargetMT.mt.name,
"table_id" : aTargetMT.id
})
contextSegment.append( {
"id" : aSeg.id,
"order" : aSeg.segment_order,
"source" : aSeg.source,
"target" : contextTarget,
})
context['segments'] = contextSegment
return context
Everything works but:
I hit the DB each time i call aSeg.segmentsTM.all() and aSeg.segmentsMT.all() because I guess the prefetch is not preventing it... this result in hundreds of duplicated queries
All these queries are repeated each time I load more entries from the paginations (in other words... each time more entries are presented because of scrolling, the full set of entries are requested... I tried also using lazy_paginate but nothing changes)
In principle all the logic I have in get_context_data (there is more but I just presented the essential code) could be reproduced in the template passing just the queryset... or by the client with a lot of jquery/javascript code but I don't think it's a good idea to proceed like this...
So my question is... I can optimize this code reducing the number of DB hits and the time to produce the response? Just to give you an idea a relative small size file (with 300 entries in the CATSegmentCollection) load in 6.5 sec with 330 queries (more than 300 duplicated) taking 0.4 sec. The DJDT time analysis gives
domainLookup 273 (+0)
connect 273 (+0)
request 275 (+-1475922263356)
response 9217 (+-1475922272298)
domLoading 9225 (+-1475922272306)
Any suggestions?
Thanks
Optimizing number of queries is a quite tricky issue as pinpointing which exact code triggered that extra query is not obvious. So I would suggest to comment out all the code inside that for loop and start uncommenting it line by line while monitoring which exact line causes extra queries, and optimize it gradually.
A few observations:
You need to carefully declare all deep relations you touch inside prefetch_related, like:
.prefetch_related('segmentsTM', 'segmentsTM__tm_source_quality_set', 'segmentsTM__tm_source_quality_set__tmSeg', 'segmentsTM__tm_source_quality_set__tmSeg__tm_client', 'segmentsMT', 'segmentsMT__mt')
No need to check if aSeg.segmentsMT.all(): before looping over it as it would still return an empty iterable.
Unrelated note regarding related_name='segMT' in your CATSegmentCollection model. related_name field is used to declare how the current model should be accessed from the other side of a relation, so you would probably want something like related_name='cATSegmentCollections' for both fields
At the end you should be able to optimize it down to somewhere around 10 queries (around one for each relation). The success criteria is not having any numerous WHERE foreign_id=X queries and have only WHERE foreign_id IN (X,Y,...) type of queries.
Following serg suggestions I started to dig into the problem and at the end I was able to prefetch all the needed info. I guess that using a thorough table change the way prefetching works... Here is the correct queryset:
all_cat_seg = CATSegmentCollection.objects.filter(theFile=theFile).order_by('segment_order')
all_tm_source_quality_entries = TM_Source_quality.objects.filter(catSeg__in=all_cat_seg).select_related('tmSeg','tmSeg__tm_client')
prefetch = Prefetch('tm_source_quality_set',queryset=all_tm_source_quality_entries)
CATSegmentCollection.objects.filter(theFile=theFile).prefetch_related(
prefetch,
'segmentsMT',
'segmentsMT__mt'
).order_by('segment_order')
With this queryset, I was able to reduce the number of queries to 10...

Django ORM - LEFT JOIN with WHERE clause

I have made a previous post related to this problem here but because this is a related but new problem I thought it would be best to make another post for it.
I'm using Django 1.8
I have a User model and a UserAction model. A user has a type. UserAction has a time, which indicates how long the action took as well as a start_time which indicates when the action began. They look like this:
class User(models.Model):
user_type = models.IntegerField()
class UserAction:
user = models.ForeignKey(User)
time = models.IntegerField()
start_time = models.DateTimeField()
Now what I want to do is get all users of a given type and the sum of time of their actions, optionally filtered by the start_time.
What I am doing is something like this:
# stubbing in a start time to filter by
start_time = datetime.now() - datetime.timedelta(days=2)
# stubbing in a type
type = 2
# this gives me the users and the sum of the time of their actions, or 0 if no
# actions exist
q = User.objects.filter(user_type=type).values('id').annotate(total_time=Coalesce(Sum(useraction__time), 0)
# now I try to add the filter for start_time of the actions to be greater than or # equal to start_time
q = q.filter(useraction__start_time__gte=start_time)
Now what this does is of course is an INNER JOIN on UserAction, thus removing all the users without actions. What I really want to do is the equivalent of my LEFT JOIN with a WHERE clause, but for the life of me I can't find how to do that. I've looked at the docs, looked at the source but am not finding an answer. I'm (pretty) sure this is something that can be done, I'm just not seeing how. Could anyone point me in the right direction? Any help would be very much appreciated. Thanks much!
I'm having the same kind of problem as you. I haven't found any proper way of solving the problem yet, but I've found a few fixes.
One way would be looping through all the users:
q = User.objects.filter(user_type=type)
for (u in q):
u.time_sum = UserAction.filter(user=u, start_time__gte=start_time).aggregate(time_sum=Sum('time'))['time_sum']
This method does however a query at the database for each user. It might do the trick if you don't have many users, but might get very time-consuming if you have a large database.
Another way of solving the problem would be using the extra method of the QuerySet API. This is a method that is detailed in this blog post by Timmy O'Mahony.
valid_actions = UserAction.objects.filter(start_time__gte=start_time)
q = User.objects.filter(user_type=type).extra(select={
"time_sum": """
SELECT SUM(time)
FROM userAction
WHERE userAction.user_id = user.id
AND userAction.id IN %s
""" % (%s) % ",".join([str(uAction.id) for uAction in valid_actions.all()])
})
This method however relies on calling the database with the SQL table names, which is very un-Django - if you change the db_table of one of your databases or the db_column of one of their columns, this code will no longer work. It though only requires 2 queries, the first one to get the list of valid userAction and the other one to sum them to the matching user.

Improving Django performance with 350000+ regs and complex query

I have a model like this:
class Stock(models.Model):
product = models.ForeignKey(Product)
place = models.ForeignKey(Place)
date = models.DateField()
quantity = models.IntegerField()
I need to get the latest (by date) quantity for every product for every place,
with almost 500 products, 100 places and 350000 stock records on the database.
My current code is like this, it worked on testing but it takes so long with the real data that it's useless
stocks = Stock.objects.filter(product__in=self.products,
place__in=self.places, date__lt=date_at)
stock_values = {}
for prod in self.products:
for place in self.places:
key = u'%s%s' % (prod.id, place.id)
stock = stocks.filter(product=prod, place=place, date=date_at)
if len(stock) > 0:
stock_values[key] = stock[0].quantity
else:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
except IndexError:
stock_values[key] = 0
else:
stock_values[key] = stock.quantity
return stock_values
How would you make it faster?
Edit:
Rewrote the code as this:
stock_values = {}
for product in self.products:
for place in self.places:
try:
stock_value = Stock.objects.filter(product=product, place=place, date__lte=date_at)\
.order_by('-date').values('cant')[0]['cant']
except IndexError:
stock_value = 0
stock_values[u'%s%s' % (product.id, place.id)] = stock_value
return stock_values
It works better (from 256 secs to 64) but still need to improve it. Maybe some custom SQL, I don't know...
Arthur's right, the len(stock) isn't the most efficient way to do that. You could go further along the "easier to ask for forgiveness than permission" route with something like this inside the inner loop:
key = u'%s%s' % (prod.id, place.id)
try:
stock = stocks.filter(product=prod, place=place, date=date_at)[0]
quantity = stock.quantity
except IndexError:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
quantity = stock.quantity
except IndexError:
quantity = 0
stock_values[key] = quantity
I'm not sure how much that would improve it compared to just changing the length check, though I think this should at least restrict it to two queries with LIMIT 1 on them (see Limiting QuerySets).
Mind you, this is still performing a lot of database hits since you could run through that loop almost 50000 times. Optimize how you're looping and you're in a better position still.
maybe the trick is in that len() method!
follow docs from:
Note: Don't use len() on QuerySets if all you want to do is determine
the number of records in the set. It's much more efficient to handle a
count at the database level, using SQL's SELECT COUNT(*), and Django
provides a count() method for precisely this reason. See count()
below.
So try changing the len to count(), and see if it makes faster!

App Engine GQL: querying a date range

What would be the App Engine equivalent of this Django statement?
return Post.objects.get(created_at__year=bits[0],
created_at__month=bits[1],
created_at__day=bits[2],
slug__iexact=bits[3])
I've ended up writing this:
Post.gql('WHERE created_at > DATE(:1, :2, :3) AND created_at < DATE(:1, :2, :4) and slug = :5',
int(bit[0]), int(bit[1]), int(bit[2]), int(bit[2]) + 1, bit[3])
But it's pretty horrific compared to Django. Any other more Pythonic/Django-magic way, e.g. with Post.filter() or created_at.day/month/year attributes?
How about
from datetime import datetime, timedelta
created_start = datetime(year, month, day)
created_end = created_start + timedelta(days=1)
slug_value = 'my-slug-value'
posts = Post.all()
posts.filter('created_at >=', created_start)
posts.filter('created_at <', created_end)
posts.filter('slug =', slug_value)
# You can iterate over this query set just like a list
for post in posts:
print post.key()
You don't need 'relativedelta' - what you describe is a datetime.timedelta. Otherwise, your answer looks good.
As far as processing time goes, the nice thing about App Engine is that nearly all queries have the same cost-per-result - and all of them scale proportionally to the records returned, not the total datastore size. As such, your solution works fine.
Alternately, if you need your one inequality filter for something else, you could add a 'created_day' DateProperty, and do a simple equality check on that.
Ended up using the relativedelta library + chaining the filters in jQuery style, which although not too Pythonic yet, is a tad more comfortable to write and much DRYer. :) Still not sure if it's the best way to do it, as it'll probably require more database processing time?
date = datetime(int(year), int(month), int(day))
... # then
queryset = Post.objects_published()
.filter('created_at >=', date)
.filter('created_at <', date + relativedelta(days=+1))
...
and passing slug to the object_detail view or yet another filter.
By the way you could use the datetime.timedelta. That lets you find date ranges or date deltas.