Peculiar QuerySet behaviour when doing an 'AND' query - django

I have some peculiar behaviour when making a query. When using an "AND" filter it takes some 20-30 seconds after the query has completed to render to the screen.
The following is a test function I have been using to try to isolate the problem.
def reports_question_detail(request, q_id):
question = get_object_or_404(promo_models.Question, pk=q_id)
import time
start_time = time.time()
question_answers = promo_models.QuestionAnswers.objects
.filter(question=q_id, promotion__workflow_status=6)
.values('questionoption').annotate(Count('id'))
print (time.time() - start_time)
return HttpResponse(question_answers)
I have tried swapping the filter query out, checking the SQL generated and timing how long to execute.
filter(question=q_id)
filter(promotion__workflow_status=6)
filter(question=q_id, promotion__workflow_status=6)
I was expecting the third query to take a lot longer, but actually each of the 3 queries take almost exactly the same time to run. However, after the execution has completed and the debug print shows the time, the third query takes another 20 seconds or so after the query has executed to render to the screen.
I then wondered if there was something wrong with the returned Queryset and tried ignoring the result by changing the following line:
HttpResponse("Finished!")
... which rendered immediately to the screen for all queries.
Finally, I wondered if there were any differences between the returned Queryset and tried doing a dump of all of the attributes. The Querysets from the first two queries dumped their attributes quickly to the console, but the third one stuttered, taking about 20-30 seconds for each line.
I'm kind of getting out of my depth now. Can anyone suggest how I investigate this further?

QuerySets are lazy. Calling filter does not actually do any calls to the database: those are only made when the queryset is iterated. So your time calls are only measuring the time taken to define a queryset object.

Related

Function not producing the same results even when transactions are rolled back

I am testing a new feature that runs on certain models that need to be in a particular state which is product of a costly operation, so running this every time I want to test the feature is a waste of time, so I figured rolling back the transactions would do.
try:
with transaction.atomic():
m = MyModel.objects.get(id=some_id)
for e in m.element_set.filter(some_filter):
new_feature(m, e)
log_state(m)
raise DatabaseError
except DatabaseError:
log_state(m) # this looks like the state was restored
This block is being used straight in a view method, with no other context variables that may be messing around with the results.
The first time everything runs as expected. I would expect the exact same behavior in the following runs, since we are rolling back the changes. However the result is completely different.
What could possibly be the reason of this? If we are reverting the database changes to the very same state it had before running the function for the first time, how are we getting a different result?

Django query using large amount of memory

I have a query that is causing memory spikes in my application. The below code is designed to show a single record, but occasionally show 5 to 10 records. The problem is there are edge cases where 100,000 results are passed to MultipleObjectsReturned. I believe this causes the high memory usage. The code is:
try:
record = record_class.objects.get(**filter_params)
context["record"] = record
except record_class.MultipleObjectsReturned:
records = record_class.objects.filter(**filter_params)
template_path = "record/%(type)s/%(type)s_multiple.html" % {"type": record_type}
return render(request, template_path, {"records": records}, current_app=record_type)
I thought about adding a slice at the end of the filter query, so it looks like this:
records = record_class.objects.filter(**filter_params)[:20]
But the code still seems slow. Is there a way to limit the results to 20 in a way that does not load the entire query or cause high memory usage?
As this_django documentation says:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
For example, this returns the first 5 objects (LIMIT 5):
Entry.objects.all()[:5]
So it seems that "limiting the results to 20 in a way that does not load the entire query" is being fulfiled .
So your code is slow for some other reason. or maybe you are checking the time complexity in wrong way.

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.

django pagination not working?

It seems that pagination in Django 1.2.3 is not working. I'm simply trying to run a query, split the results into pages of 200 objects each, and do something with the results from each page. I probably don't need to use the Paginator for this but I thought it was nice and convenient. But it seems to be giving random results - i.e. some of the objects appear on multiple pages and some do not appear on any pages. I guess that behind the scenes perhaps it is running multiple database queries and because I don't have an order_by statement the results are coming back in a different order each time? Well I'm not sure why my database (Postgres) would be giving items back in a different order each time (incidentally the data in the database is not changing). If I add an order_by to the query it seems to fix the problem. If I run this on a test database built using pg_dump/pg_restore I don't seem to have the problem (I guess the test database is somehow returning data in a consistent order). Incidentally there is only one "EQIX" row in the database.
secs = Security.objects.filter(current=True)
print 'test1'
p = Paginator(secs, 200)
for pagenumber in p.page_range:
page = p.page(pagenumber)
for i, sec in enumerate(page.object_list):
if sec.ibsymbol == 'EQIX':
print 'EQIX'
print 'test2'
for sec in secs:
if sec.ibsymbol == 'EQIX':
print 'EQIX'
trial run #1 output
test1
test2
EQIX
trial run #2 output
test1
EQIX
EQIX
EQIX
test2
EQIX

How to optimize use of querysets with lists

I have a model that has a couple million objects. Each object represents a call made/received by a company.
To simplify things, let's say this model, Call, has these fields:
calldate, context, channel.
My goal is to know the average # of calls made and received during each hour of the day of the month (load by hour). The catch is: I need to find this for port1 and port2 separately.
As of now, my code works fine, except that it takes around 1 whole minute to give me the result for a range of 4 months and I it seems extremely inefficient.
I've done some simple profiling and discovered that the extend is taking around 99% of the processing time:
queryset = Call.objects.filter(calldate__gte='SOME_DATE')
port1, port2 = [],[]
port1.extend(queryset.filter(context__icontains="e1-1"))
port2.extend(queryset.filter(context__icontains="e1-2"))
channels_in_port1 = ["Port/%d-2" % x for x in range(1,32)]
channels_in_port2 = ["Port/%d-2" % x for x in range(32,63)]
for i in channels_in_port1:
port1.extend(queryset.filter(channel__icontains=i))
for i in channels_in_port2:
port2.extend(queryset.filter(channel__icontains=i))
port1 and port2 have around 150k objects combined now.
As soon as I have all calls for port1 and port2, I'm good to go. The rest of the code is basically some for loops for port1 and port2 that sums up and takes the average of calls according to the hour/day/month. Trivial stuff.
I tried to avoid using any "extend" by using itertools.chain and chaining the querysets instead. However, that made the processing time shift to the part where I do the trivial for loops to calculate the load by hour.
Any alternatives? Better ways to filter the queryset?
Thanks very much!!
Have you considered using django's aggregate functions? http://docs.djangoproject.com/en/dev/topics/db/aggregation/
I presume your problem is with the second set of extends, ie those within the for loops, rather than the first. (The first is completely unnecessary, in any case: rather than defining an empty list up front and extending it, you can just do port1 = list(queryset.filter(context__icontains="e1-1")).)
Anyway, to summarize what I think you are trying to do: you want to get all Call objects for a certain date, in two blocks depending on the value for channel: one where it contains values from 0 to 31, and one with values between 32 and 62.
It seems like you could do this with just two queries, without any extending at all:
port1 = queryset.filter(channel__range=["Port/1-2", "Port/31-2"])
port2 = queryset.filter(channel__range=["Port/1-32", "Port/31-62"])
Does that not do what you want?
Edit in response to comment but that's then just two queries which you can extend, or concatenate. The problem with your code as posted is that you are doing 31 queries and extend operations for each port, which is bound to be expensive. If you just do one each, plus one extend/concat, that will be much cheaper.