django pagination not working? - django

It seems that pagination in Django 1.2.3 is not working. I'm simply trying to run a query, split the results into pages of 200 objects each, and do something with the results from each page. I probably don't need to use the Paginator for this but I thought it was nice and convenient. But it seems to be giving random results - i.e. some of the objects appear on multiple pages and some do not appear on any pages. I guess that behind the scenes perhaps it is running multiple database queries and because I don't have an order_by statement the results are coming back in a different order each time? Well I'm not sure why my database (Postgres) would be giving items back in a different order each time (incidentally the data in the database is not changing). If I add an order_by to the query it seems to fix the problem. If I run this on a test database built using pg_dump/pg_restore I don't seem to have the problem (I guess the test database is somehow returning data in a consistent order). Incidentally there is only one "EQIX" row in the database.
secs = Security.objects.filter(current=True)
print 'test1'
p = Paginator(secs, 200)
for pagenumber in p.page_range:
page = p.page(pagenumber)
for i, sec in enumerate(page.object_list):
if sec.ibsymbol == 'EQIX':
print 'EQIX'
print 'test2'
for sec in secs:
if sec.ibsymbol == 'EQIX':
print 'EQIX'
trial run #1 output
test1
test2
EQIX
trial run #2 output
test1
EQIX
EQIX
EQIX
test2
EQIX

Related

Django ORM: Why is exclude() so slow and how to optimize it?

I have the following 3 queries in my CBV:
filtered_content = Article.objects.filter_articles(search_term)
filtered_articles = filtered_content.exclude(source__website=TWITTER)
filtered_tweets = filtered_content.filter(source__website=TWITTER)
Short explanation:
I'm querying my database (PostgreSQL) for all article titles that contain the search term. After that, I separate the results into one variable that contains all articles originating from Twitter and the other variable contains all articles originating from all other websites.
I have two questions about optimizing these queries.
Question 1: Looking at the average time it takes to run these queries, it doesn't make sense to me (filtered_content = less than 0.001 seconds, filtered_articles = 0.2 seconds and filtered_tweets = 0.04 seconds).
What is the reason for the exclude() statement (filtered_articles) being so slow?
I also tried doing the query in another way, but this was even slower:
filtered_content = Article.objects.filter_articles(search_term)
filtered_tweets = filtered_content.filter(source__website=TWITTER)
filtered_content.exclude(article_id__in=[tweet.article_id for tweet in filtered_tweets])
Question 2: Is there a more elegant way to solve this problem / is there a way to do it in less than 3 separate queries? More specifically, using the Django ORM, is there a way to do a query where all excluded() objects are stored in one variable while all non-excluded objects are stored in another?
I don't know why is it slower, maby you need to inspect sql that exclude generates. On the other hand you can try with Q statment:
from django.db.models import Q
Article.objects.filter(~Q(source__website=TWITTER)).filter_articles(search_term)
can you share yours .filter_articles() method?

How to load savedsearch with huge data in MR script? NetSuite

We have a transactional saved search with lines in millions. The saved search fails to get load in UI, is there any way to load such saved searches in the map-reduce script?
I tried using pagination but it still shows an error (ABORT_SEARCH_EXCEEDED_MAX_TIME).
Netsuite may stil time out depending on the complexity of the search but you do not have to run the search in order to send the results to the map stage
function getInputData(ctx){
return search.load({id:'mysearchid'});
}
function map(ctx){
var ref = JSON.parse(ctx.value);
const tranRec = record.load({type:ref.recordType, id:ref.id});
log.debug({
title:'map stage with '+ ref.values.tranid, //presumes Document Number was a result column
details: ctx.value // have a look at the serialized form
});
}
Instead of getting all rows, perhaps it's even a better option to get the first nth rows (100K or even less) per MR execution, save the last internal id from the processed row and use the next internal id in the next MR script execution.

Peculiar QuerySet behaviour when doing an 'AND' query

I have some peculiar behaviour when making a query. When using an "AND" filter it takes some 20-30 seconds after the query has completed to render to the screen.
The following is a test function I have been using to try to isolate the problem.
def reports_question_detail(request, q_id):
question = get_object_or_404(promo_models.Question, pk=q_id)
import time
start_time = time.time()
question_answers = promo_models.QuestionAnswers.objects
.filter(question=q_id, promotion__workflow_status=6)
.values('questionoption').annotate(Count('id'))
print (time.time() - start_time)
return HttpResponse(question_answers)
I have tried swapping the filter query out, checking the SQL generated and timing how long to execute.
filter(question=q_id)
filter(promotion__workflow_status=6)
filter(question=q_id, promotion__workflow_status=6)
I was expecting the third query to take a lot longer, but actually each of the 3 queries take almost exactly the same time to run. However, after the execution has completed and the debug print shows the time, the third query takes another 20 seconds or so after the query has executed to render to the screen.
I then wondered if there was something wrong with the returned Queryset and tried ignoring the result by changing the following line:
HttpResponse("Finished!")
... which rendered immediately to the screen for all queries.
Finally, I wondered if there were any differences between the returned Queryset and tried doing a dump of all of the attributes. The Querysets from the first two queries dumped their attributes quickly to the console, but the third one stuttered, taking about 20-30 seconds for each line.
I'm kind of getting out of my depth now. Can anyone suggest how I investigate this further?
QuerySets are lazy. Calling filter does not actually do any calls to the database: those are only made when the queryset is iterated. So your time calls are only measuring the time taken to define a queryset object.

Python Elasticsearch query not returning the expected results when running subsequent calls

I am querying an elasticsearch index from python. Issue 1 is that when I change my query and rerun it, my objects in Python don't get refreshed according to my modified query. Issue 2 is that even if I see that I got some hits, no data comes through at all (eg I see I've got 85k hits, but when I put it in a dictionary, it is blank).
es = Elasticsearch("host:port", timeout=600, max_retries=10, revival_delay=0)
origall = es.search('esdata' ,'primary',
{"query":
{"bool":
{"must_not":
[{
"term": {"file": "original"}
}]
}
}
,"size" : "0"}
)
total_o = origall['hits']['total']
At this stage for total_o I get 110k, which is correct. Then I rerun my query after changing the size=0 to size=20, and if I want to have a look at these 20 hits, I get nothing for this:
orig = origall['hits']['hits']
print(orig)
Then I go back to my original query and change the must_not to must. In this way I should get 85k hits, but after rerunning it I still get 110k in total_o.
It is quite random when it works and when it doesn't. Sometimes I get my expected 85k hits, but then this get stuck and when I change my query back to get the 110k, it would still be 85k. Also sometimes I get data in my orig = origall['hits']['hits'], but then let's say I change the size in my query to 0, rerun it and the origall['hits']['hits'] will still give me back the data.
I use Anaconda, but tried also in Pycharm and the default Python IDLE, these behave the same. Tried to create separate ES connections for all my queries, doesn't help. Played around with cache, but no luck.
I turned out that this was due to the firewall caching settings. Not sure what the setting was originally and what we had to change it to, but play around, it helped for me.

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.