Function not producing the same results even when transactions are rolled back - django

I am testing a new feature that runs on certain models that need to be in a particular state which is product of a costly operation, so running this every time I want to test the feature is a waste of time, so I figured rolling back the transactions would do.
try:
with transaction.atomic():
m = MyModel.objects.get(id=some_id)
for e in m.element_set.filter(some_filter):
new_feature(m, e)
log_state(m)
raise DatabaseError
except DatabaseError:
log_state(m) # this looks like the state was restored
This block is being used straight in a view method, with no other context variables that may be messing around with the results.
The first time everything runs as expected. I would expect the exact same behavior in the following runs, since we are rolling back the changes. However the result is completely different.
What could possibly be the reason of this? If we are reverting the database changes to the very same state it had before running the function for the first time, how are we getting a different result?

Related

Test request on postman to evaluate performance

I was given a job on postman application which I never used that I experienced today. So I have to write queries that must be repeated in a loop in order to have results over different periods of time to check performance (basically tests). I wanted to know how to write the test query to check results in a loop based on duration, date, size.

First-run of queries are extremely slow

Our Redshift queries are extremely slow during their first execution. Subsequent executions are much faster (e.g., 45 seconds -> 2 seconds). After investigating this problem, the query compilation appears to be the culprit. This is a known issue and is even referenced on the AWS Query Planning And Execution Workflow and Factors Affecting Query Performance pages. Amazon itself is quite tight lipped about how the query cache works (tl;dr it's a magic black box that you shouldn't worry about).
One of the things that we tried was increasing the number of nodes we had, however we didn't expect it to solve anything seeing as how query compilation is a single-node operation anyway. It did not solve anything but it was a fun diversion for a bit.
As noted, this is a known issue, however, anywhere it is discussed online, the only takeaway is either "this is just something you have to live with using Redshift" or "here's a super kludgy workaround that only works part of the time because we don't know how the query cache works".
Is there anything we can do to speed up the compilation process or otherwise deal with this? So far about the best solution that's been found is "pre-run every query you might expect to run in a given day on a schedule" which is....not great, especially given how little we know about how the query cache works.
there are 3 things to consider
The first run of any query causes the query to be "compiled" by
redshift . this can take 2-20 seconds depending on how big it is.
subsequent executions of the same query use the same compiled code,
even if the where clause parameters change there is no re-compile.
Data is measured as marked as "hot" when a query has been run
against it, and is cached in redshift memory. you cannot (reliably) manually
clear this in any way EXCEPT a restart of the cluster.
Redshift will "results cache", depending on your redshift parameters
(enabled by default) redshift will quickly return the same result
for the exact same query, if the underlying data has not changed. if
your query includes current_timestamp or similar, then this will
stop if from caching. This can be turned off with SET enable_result_cache_for_session TO OFF;.
Considering your issue, you may need to run some example queries to pre compile or redesign your queries ( i guess you have some dynamic query building going on that changes the shape of the query a lot).
In my experience, more nodes will increase the compile time. this process happens on the master node not the data nodes, and is made more complex by having more data nodes to consider.
The query is probably not actually running a second time -- rather, Redshift is just returning the same result for the same query.
This can be tested by turning off the cache. Run this command:
SET enable_result_cache_for_session TO OFF;
Then, run the query twice. It should take the same time for each execution.
The result cache is great for repeated queries. Rather than being disappointed that the first execution is 'slow', be happy that subsequent cached queries are 'fast'!

Peculiar QuerySet behaviour when doing an 'AND' query

I have some peculiar behaviour when making a query. When using an "AND" filter it takes some 20-30 seconds after the query has completed to render to the screen.
The following is a test function I have been using to try to isolate the problem.
def reports_question_detail(request, q_id):
question = get_object_or_404(promo_models.Question, pk=q_id)
import time
start_time = time.time()
question_answers = promo_models.QuestionAnswers.objects
.filter(question=q_id, promotion__workflow_status=6)
.values('questionoption').annotate(Count('id'))
print (time.time() - start_time)
return HttpResponse(question_answers)
I have tried swapping the filter query out, checking the SQL generated and timing how long to execute.
filter(question=q_id)
filter(promotion__workflow_status=6)
filter(question=q_id, promotion__workflow_status=6)
I was expecting the third query to take a lot longer, but actually each of the 3 queries take almost exactly the same time to run. However, after the execution has completed and the debug print shows the time, the third query takes another 20 seconds or so after the query has executed to render to the screen.
I then wondered if there was something wrong with the returned Queryset and tried ignoring the result by changing the following line:
HttpResponse("Finished!")
... which rendered immediately to the screen for all queries.
Finally, I wondered if there were any differences between the returned Queryset and tried doing a dump of all of the attributes. The Querysets from the first two queries dumped their attributes quickly to the console, but the third one stuttered, taking about 20-30 seconds for each line.
I'm kind of getting out of my depth now. Can anyone suggest how I investigate this further?
QuerySets are lazy. Calling filter does not actually do any calls to the database: those are only made when the queryset is iterated. So your time calls are only measuring the time taken to define a queryset object.

Designing a timer functionality in VC++

I was implemnting some functionaliy in which i get a set of queries on database One shouldnt loose the query for a certain time lets say some 5min unless and untill the query is executed fine (this is incase the DB is down, we dont loose the query). so, what i was thinking to do is to set a sort of timer for each query through a different thread and wait on it for that time frame, and at the end if it still exists, remove it from the queue, but, i am not happy with this solution as i have to create as many threads as the number of queries. is there a better way to design this (environment is vc++), If the question is unclear, please let me know, i will try to frame it better.
One thread is enough to check lets say every 10 seconds that you do not have queries in that queue of yours whose due time has been reached and so should be aborted / rolled back.
Queues are usually grown from one end and erased from other end so you have to check only if the query on the end where the oldest items are has not reached its due time.

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.