Django/Postgres performance worsening after repeatedly processing the same query - django

I am running Django on Apache. I have several client computers which should call urllib2.urlopen() and send over some data which my server will process and immediately send back a reply. However, when I am testing this I found a very tricky issue. I have one client repeatedly send the same data to be processed. The first time, it takes around ~20 seconds, second time, it takes about 40 seconds, third time I get a 504 (gateway timeout) error. If I try to send data some more 504 errors randomly pop up. I am pretty sure this is an issue with Postgres as the function that processes the information makes many database calls, however, I do not know why the performance of Postgres would decline so much. I have tried several database optimization tricks, including this one: (http://stackoverflow.com/questions/1125504/django-persistent-database-connection), to no avail.
Thanks in advance.
Edit: The requests are not coming concurrently. They are coming in back to back and each query involves a lot of SELECTs and JOINs, and there are a few INSERTs and UPDATEs as well. The apache error logs show that it is just a simple timeout, where the function to process the client posted data takes over 90 seconds.

If it's really Postgres, then you should turn on the logging of slow statements in the Postgres configuration to find out which statement exactly is taking so much time.
This can be done by setting the configuration property log_min_duration.
Details are in the manual:
http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT
You say the function makes "many database calls" so I'd start with a very low number, or even 0 to log the duration of all statements, then you might be able to identify the slow ones.
It could also be a locking issued. Maybe the first call does not end its transaction properly and subsequent calls run into a timeout when waiting for a resource.
You can verify this by checking the system view pg_locks after the first call.

Have you checked the Apache error_logs? Have you set django DEBUG = True or ADMINS = ('email#addr.com',) so you can get a detailed error report about what the actual cause of the issue is? If so, how about pasting some information here.
Why are you certain that it's postgres? Have you done diagnostics to come to that conclusion? If so, please let us know.
Are you running apache with mod_wsgi? How many processes and threads have you allocated to your django application?
Also, 20 seconds to process the first transaction is a huge amount of time. Perhaps you could show us the view code that is causing the time out. We may be able to help there.
I sincerely doubt that it's going to be postgres alone that is causing the issue. It probably has something to do with application code, or server configuration.

Related

SWF Activity is not completing even though the computation has finished

I'm testing a new SWF workflow, and I've got some activity that makes a RESTful call out to another service. Problem is, I can see through logging that the actual call takes less than a second to complete, but the Activity always times out in SWF (START_TO_CLOSE of 5 mins). Being more specific, the RESTful call is a list call, and when I limit the batch size to a small number, the Activity completes and moves on very quickly. But at some seemingly arbitrary threshold, it chokes completely.
Does anyone have any insight into this? I've read that SWF calls have a size limitation of 1 MB, does anyone know how to find the size of data my workers are trying to pass SWF?
After some remote debugging, it turns out the response from the task is too big and the activity is failing silently. The failure occurs when the framework tries to report the response back to SWF, and the SDK calls RespondActivityTaskCompleted. That API has a length restriction on the internal result param:
Length Constraints: Maximum length of 32768.
This is a validation error that throws an uncaught exception and is swallowed internally until the Activity times out.
I wouldn't recommend using activity input and output parameters for passing large data sets. SWF is an orchestration technology, not the data passing one. The standard workarounds are:
Storing result in a separate store (S3 for example) and passing reference to it.
Caching result locally on a machine and route all following activities to the same host for them to have access to the cached result. See fileprocessing sample for the details of routing approach.
BTW. Have you checked out Cadence which is an open source version of SWF with much better client side libraries?

"Zombie Requests" CFQUERY tags get stuck and are unkillable

Coldfusion 2016
Microsoft Server 2012
Oracle 12
ODBC connection
I turned on profiling and monitoring and now I can see that there are requests that are stuck and cannot be terminated by the CF monitor; Some are over 200k seconds.
I know I can increase the number of simultaneous requests but I want to solve the underlying problem. As I read the stack traces of these “zombie requests” they are getting stuck on and some are in but some are not. I ran the query in my oracle client and they resolve instantly.
Is there a way to terminate these requests or prevent this from happening at all?
EDIT: The server monitor does not treat these requests as slow or hung, the alerts are not triggering for any of these. Honestly, they should have be going off constantly considering how many of these there are.
Also, the execution time is a mere .003 seconds so what happened? Why doesn't ColdFusion know this?
An example of a "zombie"
The active query that is stuck
We have a similar situation with a different database engine - redbrick, which runs on a unix server. We solved it as follows.
We set up a cron job on the database server to run every 5 minutes. This job uses a combination of unix and awk commands.
This job runs a query against the system table that looks for queries that have been running for more than 120 seconds, where the database account is the one used by ColdFusion. Records are outputted to a file. Something like this:
print "alter system cancel user command userName process " $1 ";"
$1 comes from the query and is the process Id we want to stop.
Then we run the file, which executes all those alter system commands.
With a different database engine, and possible different OS for the database server, the details would be different, but the approach should work.
Edit Starts Here
To prevent recurrence, look at the pages that call the ones with the long running queries. If impatient users are able to repeatedly click something because nothing is happening, do something about that. You can use javascript to make the link/button go away. Alternatively, you can go to an intermediate page with a display for the user and something that carries them through to the real page.

Appfabric Cache Perfmon Errors

We have a critical system that is highly dependent on Appfabric Caching. The setup we use is three nodes which serves around 2000 simultaneous connections and 150-200 requests/second.
Configurations are the default ones. We receives maybe 5-10 "ErrorCode:SubStatus" each day which is unacceptable.
I have added some performance counters but I can't see anything weird except that we sometimes see values on "Total Failure Exceptions / sec" and "Total Failure Exceptions" is increasing but one 2-3 times a day.
I would like to see what these errors comes from but I can't find them in any logs in the Event Viewer (enabled them all according to documentation). Does anyone know if these errorc could be logged somewhere and/or if it possible to seem them in any other way?
We receives maybe 5-10 "ErrorCode:SubStatus" each day which is
unacceptable.
Between 5 or 10 errors per day, with 150 requests/sec per day ?. It's quite anecdotic. Your cache client have to always handle properly caching errors. A network failure can always occurs.
5-10 "ErrorCode:SubStatus" is quite obsur. There are more than 50 error codes in AppFabric Caching. Try to get exactly these error codes. See full list here.
would like to see what these errors comes from but I can't find them
in any logs in the Event Viewer (enabled them all according to
documentation). Does anyone know if these errorc could be logged
somewhere and/or if it possible to seem them in any other way?
The only documentation available is here. The event viewer is useful to regularly monitor the health of the cache cluster. However, when troubleshooting an error, it is possible to get an even more detailed log of the cache cluster activities. I'm not sure, this will help you a lot because it's sometimes too specific.

Sustain an http connection while django processes a big request (20mins+)

I've got a django site that is producing a csv download. The content of the csv is dictated by user defined parameters. It's possible that users will set parameters that require significant thinking time on the server. I need a way of sustaining the http connection so the browser doesn't kick up an error message. I heard that it's possible to send intermittent http headers to do this. Can anyone point me in the right direction to set this up on a django site?
(unfortunatly I'm stuck with the possibility of slow reports - improving my sql won't mitigate this)
Don't do it online. Trigger an offline task, use a bit of Javascript to repeatedly call a view that checks if the task has finished, and redirect to the finished file when it's ready.
Instead of blocking the user and it's browser for 20 minutes (which is not a good idea) do the time-consuming task in the background. When the task will finish and generate the result simply notify the user so that he/she will just need to download the ready result.

How to generate Concurrent User load in Jmeter

I have a test where users will logs in and enter search keyword in search field and will get the results. Finally logs out.
Now I want to test concurrency using Jmeter. So this is what I came up with.
Test plan
Thread group
+ Login request
+ Synchronizing Controller
+ Search string
+ Synchronizing Controller
+ Logout
I have added 10 in number of threads. I have added 5 in Synchronizing Controller. So when I run the test I will get the concurrency of 5 users? Rest 5 users will be simultaneous users?
Also I have depended request when login page loads. So to achieve concurrency on login, I have added all the request in transaction controller and added Synchronizing Controller as child to transaction controller. Please let me know if I am doing it right.
Also please let me know if there is another way to achieve concurrency for specific action (ex: 5 users hitting login button at same time).
First off, you should try to distinguish between 'concurrent' and 'simultaneous'. They are normally very similar terms but in load testing they have different meanings. Simultaneous means two or more requests at the same time. Concurrent is two or more threads (scripts) running in parallel.
So, what you are talking about is trying to configure JMeter to simulate multiple simultaneous requests. But actually, there's a much, much better approach than this. Instead of focusing on trying to hit the same request at the same time, which is fiddly in JMeter, you should setup your test to be a realistic representation of the sort of load you want your application to support. If you do that well, using random wait times, throughput controllers and a realistic number of threads, then you will automatically be testing concurrency and at the same time running genuine, valid and useful performance tests too.
So, basically, drop the synchronising timer, use a constant throughput timer instead, configure wait times and then calculate the correct number of threads to generate the desired load.
The added bonus to this approach is you will be much less likely to raise false negatives. For example, if you hit your server with 5 simultaneous login requests then you might find that this call is single-threaded and the response times increase. But maybe this doesn't matter, maybe the chances of two login calls at the same time are so small that it is not worth spending time changing the code. This is a very, very important concept in load testing - perhaps the most important - you must have realistic objectives, without these you could be running tests, finding false bugs and generally wasting time forever.