Django query using large amount of memory - django

I have a query that is causing memory spikes in my application. The below code is designed to show a single record, but occasionally show 5 to 10 records. The problem is there are edge cases where 100,000 results are passed to MultipleObjectsReturned. I believe this causes the high memory usage. The code is:
try:
record = record_class.objects.get(**filter_params)
context["record"] = record
except record_class.MultipleObjectsReturned:
records = record_class.objects.filter(**filter_params)
template_path = "record/%(type)s/%(type)s_multiple.html" % {"type": record_type}
return render(request, template_path, {"records": records}, current_app=record_type)
I thought about adding a slice at the end of the filter query, so it looks like this:
records = record_class.objects.filter(**filter_params)[:20]
But the code still seems slow. Is there a way to limit the results to 20 in a way that does not load the entire query or cause high memory usage?

As this_django documentation says:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
For example, this returns the first 5 objects (LIMIT 5):
Entry.objects.all()[:5]
So it seems that "limiting the results to 20 in a way that does not load the entire query" is being fulfiled .
So your code is slow for some other reason. or maybe you are checking the time complexity in wrong way.

Related

Django ORM Performance Issues

I am looping over 13,000 in-memory city names and generating queries to filter for something. I've encountered something I cannot explain...
When the loop has a single line:
cities = City.objects.filter(name__iexact=city)
performance is almost 800 items/second
When the loop measures the length of the returned set...
cities = City.objects.filter(name__iexact=city)
num_citles = len(cities)
performance drops to 8 items/second
I can't explain where the performance degradation occurrs. Obviously, I'm missing something... Why would counting the number of items on an in-memory array that is always between 0 and 3 items reducing performance by a factor of x100?
Django querysets are lazy so QuerySet.filter does not actually evaluate the queryset i.e. run the queries in the database. When you run len function on it, it is evaluated and it will get all the items from database after running the filter only to get the count. Hence, the count is very slower.
You'll get a far better performance if you run COUNT on the database level:
num_cities = City.objects.filter(name__iexact=city).count()

Peculiar QuerySet behaviour when doing an 'AND' query

I have some peculiar behaviour when making a query. When using an "AND" filter it takes some 20-30 seconds after the query has completed to render to the screen.
The following is a test function I have been using to try to isolate the problem.
def reports_question_detail(request, q_id):
question = get_object_or_404(promo_models.Question, pk=q_id)
import time
start_time = time.time()
question_answers = promo_models.QuestionAnswers.objects
.filter(question=q_id, promotion__workflow_status=6)
.values('questionoption').annotate(Count('id'))
print (time.time() - start_time)
return HttpResponse(question_answers)
I have tried swapping the filter query out, checking the SQL generated and timing how long to execute.
filter(question=q_id)
filter(promotion__workflow_status=6)
filter(question=q_id, promotion__workflow_status=6)
I was expecting the third query to take a lot longer, but actually each of the 3 queries take almost exactly the same time to run. However, after the execution has completed and the debug print shows the time, the third query takes another 20 seconds or so after the query has executed to render to the screen.
I then wondered if there was something wrong with the returned Queryset and tried ignoring the result by changing the following line:
HttpResponse("Finished!")
... which rendered immediately to the screen for all queries.
Finally, I wondered if there were any differences between the returned Queryset and tried doing a dump of all of the attributes. The Querysets from the first two queries dumped their attributes quickly to the console, but the third one stuttered, taking about 20-30 seconds for each line.
I'm kind of getting out of my depth now. Can anyone suggest how I investigate this further?
QuerySets are lazy. Calling filter does not actually do any calls to the database: those are only made when the queryset is iterated. So your time calls are only measuring the time taken to define a queryset object.

Django queries are 40 times slower than identical Postgres queries?

I am running Django 1.7 with Postgres 9.3, running with runserver. My database has about 200m rows in it or about 80GB of data. I'm trying to debug why the same queries are reasonably fast in Postgres, but slow in Django.
The data structure is like this:
class Chemical(models.Model):
code = models.CharField(max_length=9, primary_key=True)
name = models.CharField(max_length=200)
class Prescription(models.Models):
chemical = models.ForeignKey(Chemical)
... other fields
The database is set up with C collation and suitable indexes:
Table "public.frontend_prescription"
Column | Type | Modifiers
id | integer | not null default nextval('frontend_prescription_id_seq'::regclass)
chemical_id | character varying(9) | not null
Indexes:
"frontend_prescription_pkey" PRIMARY KEY, btree (id)
"frontend_prescription_a69d813a" btree (chemical_id)
"frontend_prescription_chemical_id_4619f68f65c49a8_like" btree (chemical_id varchar_pattern_ops)
This is is my view:
def chemical(request, bnf_code):
c = get_object_or_404(Chemical, bnf_code=bnf_code)
num_prescriptions = Prescription.objects.filter(chemical=c).count()
context = {
'num_prescriptions': num_prescriptions
}
return render(request, 'chemical.html', context)
The bottleneck is the .count(). call. The Django debug toolbar shows that the time taken on this is 2647ms (under the "Time" heading below), but the EXPLAIN section suggests the time taken should be 621ms (at the bottom):
Even stranger, if I run the same query directly in Postgres it seems to take only 200-300ms:
# explain analyze select count(*) from frontend_prescription where chemical_id='0212000AA';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=279495.79..279495.80 rows=1 width=0) (actual time=296.318..296.318 rows=1 loops=1)
-> Bitmap Heap Scan on frontend_prescription (cost=2104.44..279295.83 rows=79983 width=0) (actual time=162.872..276.439 rows=302389 loops=1)
Recheck Cond: ((chemical_id)::text = '0212000AA'::text)
-> Bitmap Index Scan on frontend_prescription_a69d813a (cost=0.00..2084.44 rows=79983 width=0) (actual time=126.235..126.235 rows=322252 loops=1)
Index Cond: ((chemical_id)::text = '0212000AA'::text)
Total runtime: 296.591 ms
So my question: in the debug toolbar, the EXPLAIN statement differs from actual performance in Django. And it is slower still than a raw query in Postgres.
Why is there this discrepancy? And how should I debug this / improve the performance of my Django app?
UPDATE: Here's another random example: 350ms for EXPLAIN, more than 10,000 to render! Help, this is making my Django app almost unusable.
UPDATE 2: Here's the Profiling panel for another slow (40 seconds in Django, 600ms in EXPLAIN...) query. If I'm reading it right, it suggests that each SQL call from my view is taking 13 seconds... is this the bottleneck?
What's odd is that the profiled calls are only slow for queries that return lots of results, so I don't think the delay is some Django connection overhead that applies to every call.
UPDATE 3: I tried rewriting the view in raw SQL and the performance is now better some of the time, although I'm still seeing slow queries about half the time. (I do have to create and re-create the cursor each time, otherwise I get InterfaceError and a message about the cursor being dead - not sure if this is useful for debugging. I've set CONN_MAX_AGE=1200.) Anyway, this performs OK, though obviously it's vulnerable to injection etc as written:
cursor = connection.cursor()
query = "SELECT * from frontend_chemical WHERE code='%s'" % code
c = cursor.execute(query)
c = cursor.fetchone()
cursor.close()
cursor = connection.cursor()
query = "SELECT count(*) FROM frontend_prescription WHERE chemical_id="
query += "'" + code + "';"
cursor.execute(query)
num_prescriptions = cursor.fetchone()[0]
cursor.close()
context = {
'chemical': c,
'num_prescriptions': num_prescriptions
}
return render(request, 'chemical.html', context)
It's not reliable profiling code on your development machine (revealed in comments, all sorts of things are running on your desktop that might be interfering). It's also not going to show you real-world performance to examine runtimes with django-debug-toolbar active. If you are interested in how this thing will perform in the wild you have to run it on your intended infrastructure and measure it with a light touch.
def some_view(request):
search = get_query_parameters(request)
before = datetime.datetime.now()
result = ComplexQuery.objects.filter(**search)
print "ComplexQuery took",datetime.datetime.now() - before
return render(request, "template.html", {'result':result})
Then you need to run this several times to warm up caches before you can do any sort of measuring. Results will vary wildly with setups. You could be using connection pooling that takes warming up, postgres is quicker on subsequent queries of the same sort, django might also be set up the have some local cache, all of which need spinup before you can say for sure it's that query.
All the profiling tools report times without factoring in their own introspection slow-down, you have to take a relative approach and use DDT (or my favorite for these problems: django-devserver) to identify hotspots in request handlers that consistently perform badly. One other tool worthy of note: linesman. It's a bit of a hassle to set up and maintain but really really useful.
I have been responsible for fairly large setups (DB size in tens of GB) and haven't seen a simple query like that run aground that badly. First find out if you really have a problem (that it's not just runserver ruining your day), then use the tools to find that hotspot, then optimize.
It is very likely that when Django runs the query, the data needs to be read from disk. But when you check why the query was slow, the data is already in memory due to the earlier query.
The easiest solutions are to buy more memory, or a faster io system.

Quantity of database query's vs application memory performance using Django

If I need a total of all objects in a query set as well as a slice of filed values from those objects, which option would be better considering speed and application memory use (I am using a PostgreSQL backend):
Option a:
def get_data():
queryset = MyObject.objects.all()
total_objects = queryset.count()
thumbs = queryset[:5].values_list('thumbnail', flat=True)
return {total_objects:total_objects, thumbs:thumbs}
Option b:
def get_data():
objects = list(MyObject.objects.all())
total_objects = len(objects)
thumbs = [o.thumbnail for o in objects[:5]]
return {total_objects:total_objects, thumbs:thumbs}
If I understand things correctly, and certainly correct me if I am wrong:
Option a: It will hit the database two times and will result in only total_objects = integer and thumbs = list of strings in memory.
Option b: It will hit the database one time and will result in a list of all objects and all their filed data + option a items in memory.
Considering these options and that there are potentially millions of instances of MyObject: Is the speed of one data base hit (options a) preferable to the memory consumption of a single data base hit (option b)?
My priority is for overall speed in returning the data, but I am concerned about the larger memory consumption slowing things down even more than the extra database hit.
Using SQL is the fastest method and will always beat the Python equivalent, even if it hits the database more. The difference is negligible in comparison. Remember, that's what SQL is meant to do - be fast and efficient.
Anyway, running a thousand loops using timeit, these are the results:
In [8]: %timeit get_data1() # Using ORM
1000 loops, best of 3: 628 µs per loop
In [9]: %timeit get_data2() # Using python
1000 loops, best of 3: 1.54 ms per loop
As you can see, the first method takes 628 microseconds per loop, while the second one takes 1.54 milliseconds. That's almost 2.5 times as much! A clear winner.
I used an SQLite database with only 100 objects in it (I used autofixture to spam the models). I'm guessing PostgreSQL will return different results, but I am still in favor of the first one.

Deleting large Django QuerySets is causing an Apache Internal Server Error

I am trying to do delete roughly 200,000 objects (which all have multiple related objects, totalling roughly 2,000,000 objects) using:
DataRecord.objects.filter(order=self.order).delete()
But I am getting an Internal Server Error (after about 20 minutes or so), and none of the objects are deleted. I have the Apache timeout set to 3600 (1 hour) to give enough time for this operation.
Is there a more efficient way to delete a very large number of objects in bulk?
Seems like the best solution is to use a raw query (see https://docs.djangoproject.com/en/dev/topics/db/sql/#executing-custom-sql-directly ), but pre_delete, post_delete signals will not be fired.
random ORM idea: is DataRecord.order column indexed?
edit:
to recognize if column is easy: see if column has set db_index property, ie:
class DataRecord(models.Model):
order = models.IntegerField(_("order"), **db_index=True**)
Index allows to find data fast, without reading the whole table. It's like an index in a book - when you want to find some word in there, index will help you to find it out without reading the whole book.
Find the number of the objects that are going to be deleted and break down the deletion to multiples of one thousand or so in a for loop.
A simple example:
q = DataRecord.objects.filter(order=self.order)
cnt = q.count()
bucket = 1000
a, rem = divmod(cnt, bucket)
i, j, k = 0, bucket, 0
while k<a:
for obj in q[k*bucket: (k+1)*bucket + (k+1==a and rem)]:
obj.delete()
k+=1