Improving Django performance with 350000+ regs and complex query

Improving Django performance with 350000+ regs and complex query - django

I have a model like this:
class Stock(models.Model):
product = models.ForeignKey(Product)
place = models.ForeignKey(Place)
date = models.DateField()
quantity = models.IntegerField()
I need to get the latest (by date) quantity for every product for every place,
with almost 500 products, 100 places and 350000 stock records on the database.
My current code is like this, it worked on testing but it takes so long with the real data that it's useless
stocks = Stock.objects.filter(product__in=self.products,
place__in=self.places, date__lt=date_at)
stock_values = {}
for prod in self.products:
for place in self.places:
key = u'%s%s' % (prod.id, place.id)
stock = stocks.filter(product=prod, place=place, date=date_at)
if len(stock) > 0:
stock_values[key] = stock[0].quantity
else:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
except IndexError:
stock_values[key] = 0
else:
stock_values[key] = stock.quantity
return stock_values
How would you make it faster?
Edit:
Rewrote the code as this:
stock_values = {}
for product in self.products:
for place in self.places:
try:
stock_value = Stock.objects.filter(product=product, place=place, date__lte=date_at)\
.order_by('-date').values('cant')[0]['cant']
except IndexError:
stock_value = 0
stock_values[u'%s%s' % (product.id, place.id)] = stock_value
return stock_values
It works better (from 256 secs to 64) but still need to improve it. Maybe some custom SQL, I don't know...

Arthur's right, the len(stock) isn't the most efficient way to do that. You could go further along the "easier to ask for forgiveness than permission" route with something like this inside the inner loop:
key = u'%s%s' % (prod.id, place.id)
try:
stock = stocks.filter(product=prod, place=place, date=date_at)[0]
quantity = stock.quantity
except IndexError:
try:
stock = stocks.filter(product=prod, place=place).order_by('-date')[0]
quantity = stock.quantity
except IndexError:
quantity = 0
stock_values[key] = quantity
I'm not sure how much that would improve it compared to just changing the length check, though I think this should at least restrict it to two queries with LIMIT 1 on them (see Limiting QuerySets).
Mind you, this is still performing a lot of database hits since you could run through that loop almost 50000 times. Optimize how you're looping and you're in a better position still.

maybe the trick is in that len() method!
follow docs from:
Note: Don't use len() on QuerySets if all you want to do is determine
the number of records in the set. It's much more efficient to handle a
count at the database level, using SQL's SELECT COUNT(*), and Django
provides a count() method for precisely this reason. See count()
below.
So try changing the len to count(), and see if it makes faster!

Related

Django Mass Update/Insert Performance

I'm receiving financial data for approximately 5000 instruments every 5 seconds, and need to update the respective entries in the database. The model looks as follows:
class Market(models.Model):
market = models.CharField(max_length=200)
exchange = models.ForeignKey(Exchange,on_delete=models.CASCADE)
ask = models.FloatField()
bid = models.FloatField()
lastUpdate = models.DateTimeField(default = timezone.now)
What needs to happen is the following:
After new financial data is received, check if an entry exists in the
database.
If the entry exists, update the ask, bid and lastUpdate fields
If the entry does not exist, create a new entry
My code looks as follows:
bi_markets = []
for item in dbMarkets:
eItem = Market.objects.filter(exchange=item.exchange,market=item.market)
if len(eItem) > 0:
eItem.update(ask=item.ask,bid=item.bid)
else:
bi_markets.append(item)
#Bulk insert items that does not exist
Market.objects.bulk_create(bi_markets)
However executing this takes way too long. Approximately 30 seconds. I need to reduce the time down to 1 second. I know this can be done as I do the same wth custom SQL code in .NET in under 100ms. Any idea how to improve the performance in Django?

If it’s this kind of performance you’re going for, I don’t see why you wouldn’t just break out into raw SQL. Bulk creating things that don’t exist yet sounds like the advanced SQL querying that Django isn’t really made for.
https://docs.djangoproject.com/en/2.0/topics/db/sql/
You can also do (sorry on mobile):
bi_markets = []
for item in dbMarkets:
rows = Market.objects.filter(exchange=item.exchange, market=item.market).update(ask=item.ask, bid=item.bid)
if rows == 0:
bi_markets.append(item)
Market.objects.bulk_create(bi_markets)
Maybe that combination will generate some better SQL and it sidesteps the exists() call as well (update returns how many rows it changed).

I've decided to split the update and create functionality. The create only happens when the app starts, from there on I do updates using custom SQL script. See below. Working great.
updateQ = []
updateQ.append("BEGIN TRANSACTION;")
for dbItem in dbMarkets:
eItem = tickers[dbItem.market]
qStr = "UPDATE app_market SET ask = " + str(eItem['ask']) + ",bid = " + str(eItem['bid']) + " WHERE exchange_id = " + str(e.dbExchange.pk) + " AND market = " + '"' + dbItem.market + '";'
updateQ.append(qStr)
updateQ.append("COMMIT;")
updateQFinal = ''.join(map(str, updateQ))
with connection.cursor() as cursor:
cursor.executescript(updateQFinal)

Optimization of big queryset/context in django

I have to present a very complex page with a lot of data coming from 3 different tables related with ForeignKey and ManyToManyField... I was able to do what I want but performance is terrible and I'm stuck trying to find a better approach... here are detailed code:
Models:
class CATSegmentCollection(models.Model):
theFile = models.ForeignKey('app_file.File', related_name='original_file')
segmentsMT = models.ManyToManyField('app_mt.MachineTransTable', related_name='segMT', blank=True,)
segmentsTM = models.ManyToManyField('app_tm.TMTable', related_name='segTM', blank=True, through='app_cat.TM_Source_quality',)
...
class TM_Source_quality(models.Model):
catSeg = models.ForeignKey('app_cat.CATSegmentCollection')
tmSeg = models.ForeignKey('app_tm.TMTable')
quality = models.IntegerField()
class MachineTransTable(models.Model):
mt = models.ForeignKey('app_mt.MT_available', blank=True, null=True, )
...
class TMTable(models.Model):
...
From these models (I just wrote what is relevant to my problem) I present all the CATSegmentCollection entries related to a single file... together with its associated TM and MT segments. In other words each entry in CATSegmentCollection has zero or more TM segment from the TMTable table and zero or more MT segment from the MachineTransTable table.
This is what I do in the ListView (and I use AjaxListView because I'm using a infinite scrolling pagination from django-el-pagination):
class CatListView(LoginRequiredMixin, AjaxListView):
Model = CATSegmentCollection
template_name = 'app_cat/cat.html'
page_template='app_cat/cat_page.html'
def get_object(self, queryset=None):
obj = File.objects.get(id=self.kwargs['file_id'])
return obj
def get_queryset(self):
theFile = self.get_object()
return CATSegmentCollection.objects.filter(theFile=theFile).prefetch_related('segmentsMT').prefetch_related('segmentsTM').order_by('segment_order')
def get_context_data(self, **kwargs):
context = super(CatListView, self).get_context_data(**kwargs)
contextSegment = []
myCatCollection = self.get_queryset()
theFile = self.get_object()
context['file'] = theFile
for aSeg in myCatCollection:
contextTarget = []
if aSeg.segmentsTM.all():
for aTargetTM in aSeg.tm_source_quality_set.all():
percent_quality = ...
contextTarget.append( {
"source" : aTargetTM.tmSeg.source,
"target" : aTargetTM.tmSeg.target,
"quality" : str(percent_quality) + '%',
"origin" : "TM",
"orig_name" : aTargetTM.tmSeg.tm_client.name,
"table_id" : aTargetTM.tmSeg.id,
})
if aSeg.segmentsMT.all():
for aTargetMT in aSeg.segmentsMT.all():
contextTarget.append( {
"target" : aTargetMT.target,
"quality" : "",
"origin" : "MT",
"orig_name" : aTargetMT.mt.name,
"table_id" : aTargetMT.id
})
contextSegment.append( {
"id" : aSeg.id,
"order" : aSeg.segment_order,
"source" : aSeg.source,
"target" : contextTarget,
})
context['segments'] = contextSegment
return context
Everything works but:
I hit the DB each time i call aSeg.segmentsTM.all() and aSeg.segmentsMT.all() because I guess the prefetch is not preventing it... this result in hundreds of duplicated queries
All these queries are repeated each time I load more entries from the paginations (in other words... each time more entries are presented because of scrolling, the full set of entries are requested... I tried also using lazy_paginate but nothing changes)
In principle all the logic I have in get_context_data (there is more but I just presented the essential code) could be reproduced in the template passing just the queryset... or by the client with a lot of jquery/javascript code but I don't think it's a good idea to proceed like this...
So my question is... I can optimize this code reducing the number of DB hits and the time to produce the response? Just to give you an idea a relative small size file (with 300 entries in the CATSegmentCollection) load in 6.5 sec with 330 queries (more than 300 duplicated) taking 0.4 sec. The DJDT time analysis gives
domainLookup 273 (+0)
connect 273 (+0)
request 275 (+-1475922263356)
response 9217 (+-1475922272298)
domLoading 9225 (+-1475922272306)
Any suggestions?
Thanks

Optimizing number of queries is a quite tricky issue as pinpointing which exact code triggered that extra query is not obvious. So I would suggest to comment out all the code inside that for loop and start uncommenting it line by line while monitoring which exact line causes extra queries, and optimize it gradually.
A few observations:
You need to carefully declare all deep relations you touch inside prefetch_related, like:
.prefetch_related('segmentsTM', 'segmentsTM__tm_source_quality_set', 'segmentsTM__tm_source_quality_set__tmSeg', 'segmentsTM__tm_source_quality_set__tmSeg__tm_client', 'segmentsMT', 'segmentsMT__mt')
No need to check if aSeg.segmentsMT.all(): before looping over it as it would still return an empty iterable.
Unrelated note regarding related_name='segMT' in your CATSegmentCollection model. related_name field is used to declare how the current model should be accessed from the other side of a relation, so you would probably want something like related_name='cATSegmentCollections' for both fields
At the end you should be able to optimize it down to somewhere around 10 queries (around one for each relation). The success criteria is not having any numerous WHERE foreign_id=X queries and have only WHERE foreign_id IN (X,Y,...) type of queries.

Following serg suggestions I started to dig into the problem and at the end I was able to prefetch all the needed info. I guess that using a thorough table change the way prefetching works... Here is the correct queryset:
all_cat_seg = CATSegmentCollection.objects.filter(theFile=theFile).order_by('segment_order')
all_tm_source_quality_entries = TM_Source_quality.objects.filter(catSeg__in=all_cat_seg).select_related('tmSeg','tmSeg__tm_client')
prefetch = Prefetch('tm_source_quality_set',queryset=all_tm_source_quality_entries)
CATSegmentCollection.objects.filter(theFile=theFile).prefetch_related(
prefetch,
'segmentsMT',
'segmentsMT__mt'
).order_by('segment_order')
With this queryset, I was able to reduce the number of queries to 10...

How expensive are `count` calls for Django querysets?

I have a list of "posts" I have to render. For each post, I must do three filter querysets, OR them together, and then count the number of objects. Is this reasonable? What factors might make this slow?
This is roughly my code:
def viewable_posts(request, post):
private_posts = post.replies.filter(permissions=Post.PRIVATE, author_profile=request.user.user_profile).order_by('-modified_date')
community_posts = post.replies.filter(permissions=Post.COMMUNITY, author_profile__in=request.user.user_profile.following.all()).order_by('-modified_date')
public_posts = post.replies.filter(permissions=Post.PUBLIC).order_by('-modified_date')
mixed_posts = private_posts | community_posts | public_posts
return mixed_posts
def viewable_posts_count(request, post):
return viewable_posts(request, post).count()

The biggest factor I can see is that you have filter actions on each post. If possible, you should query the results associated with each post in ONE query. As of the count, it's the most efficient way of getting the number of results from a query, so it's likely not a problem.

Try the following code:
def viewable_posts(request, post):
private_posts = post.replies.filter(permissions=Post.PRIVATE, author_profile=request.user.user_profile).values_list('id',flat=True)
community_posts = post.replies.filter(permissions=Post.COMMUNITY, author_profile__in=request.user.user_profile.following.values_list('id',flat=True)
public_posts = post.replies.filter(permissions=Post.PUBLIC).values_list('id',flat=True)
Lposts_id = private_posts
Lposts_id.extend(community_posts)
Lposts_id.extend(public_posts)
viewable_posts = post.filter(id__in=Lposts_id).order_by('-modified_date')
viewable_posts_count = post.filter(id__in=Lposts_id).count()
return viewable_posts,viewable_posts_count
It should improve the following things:
order_by once, instead of three times
The count method runs on a query with only the index field
django uses a faster filter with "values", both for the count and the filtering.
Depends on your database, the db own cache may pick the last queried posts for viewable_posts, and use it for viewable_posts_count
Indeed, if you can squeeze the first three filter queries into one, you will save time as well.

Running a lot of queries

I have the following models with its methods:
class TrendingTopic(models.Model):
categories = models.ManyToManyField('Category', through='TTCategory', blank=True, null=True)
location = models.ForeignKey(Location)
def get_rank(self, t_date=None):
if t_date:
ttcs = self.trendingtopiccycle_set.filter(cycle_time__gt=t_date)
else:
ttcs = self.trendingtopiccycle_set.all()
if ttcs:
return sum([ttc.rank for ttc in ttcs])/len(ttcs)
return 0
def get_day_rank(self,t_date):
ttcs = self.trendingtopiccycle_set.filter(cycle_time__year=t_date.year,
cycle_time__month=t_date.month,
cycle_time__day=t_date.day)
sum_rank = sum([ttc.day_rank for ttc in ttcs if ttc.day_rank])
if sum_rank:
return sum_rank/len(ttcs)
return 0
class TrendingTopicCycle(models.Model):
tt = models.ForeignKey(TrendingTopic)
cycle_time = models.DateTimeField(default=datetime.now)
from_tt_before = models.BooleanField(default=False)
rank = models.FloatField(default=0.0)
day_rank = models.FloatField(default=0.0)
And then I have some functions that are used in the views to retrieve the desired information:
Show the best trending topics of the current day:
def day_topics(tt_date, limit=10):
tts = [(ttc.tt, ttc.tt.get_day_rank(tt_date)) for ttc in \
TrendingTopicCycle.objects.distinct('tt__name') \
.filter(cycle_time__year=tt_date.year,
cycle_time__month=tt_date.month,
cycle_time__day=tt_date.day)]
sorted_tts = sorted(tts, key=itemgetter(1),
reverse=True)[:limit]
return sorted_tts
Show the best trending topics of a given location(woeid) within a determined lapse of time:
def hot_topics(woeid=None, limit=10):
CYCLE_LIMIT = datetime.now() + relativedelta(hours=-5)
TT_CYCLES_LIMIT = datetime.now() + relativedelta(days=-2)
if woeid:
tts = [ttc.tt for ttc in \
TrendingTopicCycle.objects.filter(tt__location__woeid=woeid) \
.distinct('tt__name') \
.exclude(cycle_time__lt=CYCLE_LIMIT)]
else:
tts = [ttc.tt for ttc in \
TrendingTopicCycle.objects.distinct('tt__name') \
.exclude(cycle_time__lt=CYCLE_LIMIT)]
sorted_tts = sorted(tts, key=lambda tt: tt.get_rank(TT_CYCLES_LIMIT),
reverse=True)[:limit]
return sorted_tts
The problem with the current solution is that it runs really slow because its doing a lot of queries(100's) to retrieve the data. I'm using the django debug toolbar to help me measuring the performance.
Obviously I'm doing something terribly wrong and I'm looking for a solution, any help would be greatly appreciated.
Edit:
Each trending topic has a set of trending topic cycle(ttc). Each ttc has two ranks: the general one(rank) and the day_rank. The trending topic ranks are calculated looping through each ttc.

First, be aware that django-debug-toolbar, while fantastic, is itself very slow. If you comment out its middleware, your response times improve dramatically. It's a very useful tool, don't get me wrong. I use it myself, religiously, but the point is you can't benchmark your site on something subjective like "slowness" while it's enabled.
Second, your code is a little confusing, so it's difficult to say exactly what you should do. For example, TrendingTopicCycle has a rank and day_rank field, but you never use them in the posted code. Calling get_day_rank issues a query each time, so it would obviously be more efficient if you could just filter on the day_rank field itself (eliminating the need for that query), but I cannot tell from the code you have here if or when those fields are actually set.
A small improvement you can make to the code as is, is judiciously using select_related. For instance, each time ttc.tt.get_day_rank(tt_date)) is run in the list comprehension, a query is issued to get tt and then another query is issued in get_day_rank. Simply adding .select_related('tt') to your queryset would at least eliminate that query for tt.
Also, I'm not sure if it actually causes Django to issue a different query or not (and perhaps a more inefficient query), but regardless, there's no point in filtering individually for year, month, and day, just filter on the full date, i.e.:
TrendingTopicCycle.objects.distinct('tt__name') \
.filter(cycle_time=date)

if ttcs:
return sum([ttc.rank for ttc in ttcs])/len(ttcs)
return 0
This can be replaced by a db query. https://docs.djangoproject.com/en/dev/topics/db/aggregation/
something like:
ttcs.Aggregate(Sum('rank'))["sum__rank"]

Retrieve a list of matching objects from a range of ids in Django

I want to achieve something relatively simple:
I want to retrieve all objects from my model given a range of ids
(for eg, retrieve the lines 5 to 10 from a book's chapter).
Right now in my views.py, I've:
def line_range(request, book_id, chapter_id, line_start, line_end):
book_name = get_object_or_404(Book, id = book_id)
chapter_lines = []
for i in range (int(line_start), int(line_end)+1):
chapter_lines .append(Line.objects.get(book = book_id, chapter = chapter_id, line = i))
return render_to_response('app/book.html', {'bookTitle': book_name, 'lines': chapter_lines })
Now, this is obviously not the most optimized way of doing things, as it would do n database queries, when it could be done in only one.
Is there a way of doing something like:
def line_range(request, book_id, chapter_id, line_start, line_end):
book_name = get_object_or_404(Book, id = book_id)
lines_range = range (int(line_start), int(line_end)+1)
chapter_lines = get_list_or_404(Line, book = book_id, chapter = chapter_id, line = lines_range)
return render_to_response('app/book.html', {'bookTitle': book_name, 'lines': chapter_lines })
This would in theory generate a much better database query (1 instead of n) and should be better performance wise.
Of course, this syntax does not work (expecting an integer, not a list).
Thanks!

I think you want __range:
Range test (inclusive).
Example:
start_date = datetime.date(2005, 1, 1)
end_date = datetime.date(2005, 3, 31)
Entry.objects.filter(pub_date__range=(start_date, end_date))
SQL equivalent:
SELECT ... WHERE pub_date BETWEEN '2005-01-01' and '2005-03-31';
You can use range anywhere you can use BETWEEN in SQL — for dates, numbers and even characters.
So yours would be, I think:
chapter_lines = get_list_or_404(..., line__range=(int(line_start), int(line_end)+1))
Likewise, you can use __lt, __gt, __lte, __gte for one-sided comparisons.
I would encourage you to always keep a window open with the Django documentation. There is lots of great info there if you just look.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Improving Django performance with 350000+ regs and complex query - django

Related

Django Mass Update/Insert Performance

Optimization of big queryset/context in django

How expensive are `count` calls for Django querysets?

Running a lot of queries

Retrieve a list of matching objects from a range of ids in Django

Categories

Resources