Django sites' response time variance - django

I'm developing a web service that measures site response time. It's a Django app, and as an initial test I pointed it at a couple of my other Django sites on the same VPS. The response time was small (~5ms) most of the time, but on a fairly regular (5 or 10 min) schedule, jumped to a much higher value (up to 400ms, despite no heavy DB load or cache).
Suspicious of my own timing methodology, I pointed it at a static site on the same VPS, and got a consistent quick response. I then used nginx's response_time and upstream_response_time logging to find that it really was my Django apps that were giving the response time variance.
I then pointed the app at a few other Django sites around the web and found similar results: there's a fast "baseline" response with fairly regular spikes to one or two remarkably repeatable slower times. For example, one has a 25ms baseline with ~200ms and ~400ms "steps".
Using the Django Debug Toolbar's Time tab on one of my sites, I can see this behaviour. Most F5 reloads are quick, but there's an occasional (fairly consistently one in ten) slow one, with all the time spent in the "request" section.
Any ideas?

Solved it: it was gunicorn's "max-requests" setting, spawning a new worker every X connections. When X is ten and I'm hitting a "quiet" site, the long response time is every ten hits.
That's why I only noticed the multi-modal times on some Django sites: the others were presumably busy or not using gunicorn.

Related

Best wsgi service for handling webhooks with few resources

Currently working on a Virtual server with 2 CPU's 4GB of ram. I am running a Flask + uwsgi + nginx to host the webserver. I need the server to be capable of accepting about 10 out of 2500-ish the requests a day. The requests that don't pass average about 2ms yet the queue is consistently backed up. The issue I have been encountering lately is both speed and duplication when it does work. As the accepted webhooks are sent to another server and I will get duplicates or completely miss a bunch.
[uwsgi]
module = wsgi
master = true
processes = 4
enable-threads = true
threads = 2
socket = API.sock
chmod-socket = 660
vacuum = true
harakiri = 10
die-on-term = true
This is my current .ini file I have messed around with harakiri and have read countless hours through the uwsgi documentation trying different things it is unbelievably frustrating.
Picture of Systemctl status API
The check for it looks similar to this redacted some info.
#app.route('/api', methods=['POST'])
def handle_callee():
authorization = request.headers.get('authorization')
if authorization == SECRET and check_callee(request.json):
data = request.json
name = data["payload"]["object"]["caller"]["name"]
create_json(name, data)
return 'success', 200
else:
return 'failure', 204
The json is then parsed through a number of functions. This is my first time deploying a wsgi service and I don't know if my configuration is incorrect. I've poured hours of research into trying to fix this. Should I try switching to gunicorn. I have asked this question differently a couple of days ago but to no avail. Trying to put more context in hopes someone could point me in the right direction. I don't even know in the systemctl status whether the | req: 12/31 is how many it's done thus far and what's queued for that PID. Any insight into this situation would make my week. I've been unable to fix this for about 2 weeks of trying different configs increasing working, processes, messing with the harakiri, disabling logging. But none of this has proved to get the requests to process at a speed that I desire.
Thank you to anyone who took the time to read this, I am still learning and have tried to add as much context as possible. If you need more I will gladly respond. I just can't wrap my head around this issue.
You would need to take a systematic approach in figuring out:
How many requests per second can you handle
What are your apps bottlenecks and scaling factors
Cloudbees have written a great article on performance tuning for uwsgi + flask + nginx.
To give an overview of the steps to tune your service here is what it might look like:
First, you need to make sure you have the required tooling, particularly a benchmarking tool like Apache Bench, k6, etc.
Establish a base. This means that you configure your application with the minimum setup to run i.e. single thread and single process, no multi-threading. Run the benchmark and record the results.
Start tweaking the setup. Add threads, processes, etc.
Benchmark after the tweaks.
Repeat steps 2 & 3 until you see the upper limits, and understand the service characteristics - are you CPU/IO bound?
Try changing the hardware/vm, as some offerings come with penalties in performance due to shared CPU with other tenants, bandwidth, etc.
Tip: Try to run the benchmark tool from a different system than the one you are benchmarking, since it also consumes resources and loads the system further.
In your code sample you have two methods create_json(name, data) and check_callee(request.json), do you know their performance?
Note: Can't comment so had to write this as an answer.

How to bring down maximum response time django?

I am using new relic to get some info on how long response times are taking. I have also been doing load tests using blitz. I can see on new relic that for a lot of the api endpoints it is taking around 300 ms average (which I am totally happy with these are geo spatial queries btw). The only issue is the maximum is 55k ms and some users are complaining of certain things taking a while to load.
How can I make these endpoints more reliable that they will take 300 ms more often than 55k ms?
edit:
The main question is why do these responses sometimes take 55k ms? Is this user connection speed or the code?

How to generate Concurrent User load in Jmeter

I have a test where users will logs in and enter search keyword in search field and will get the results. Finally logs out.
Now I want to test concurrency using Jmeter. So this is what I came up with.
Test plan
Thread group
+ Login request
+ Synchronizing Controller
+ Search string
+ Synchronizing Controller
+ Logout
I have added 10 in number of threads. I have added 5 in Synchronizing Controller. So when I run the test I will get the concurrency of 5 users? Rest 5 users will be simultaneous users?
Also I have depended request when login page loads. So to achieve concurrency on login, I have added all the request in transaction controller and added Synchronizing Controller as child to transaction controller. Please let me know if I am doing it right.
Also please let me know if there is another way to achieve concurrency for specific action (ex: 5 users hitting login button at same time).
First off, you should try to distinguish between 'concurrent' and 'simultaneous'. They are normally very similar terms but in load testing they have different meanings. Simultaneous means two or more requests at the same time. Concurrent is two or more threads (scripts) running in parallel.
So, what you are talking about is trying to configure JMeter to simulate multiple simultaneous requests. But actually, there's a much, much better approach than this. Instead of focusing on trying to hit the same request at the same time, which is fiddly in JMeter, you should setup your test to be a realistic representation of the sort of load you want your application to support. If you do that well, using random wait times, throughput controllers and a realistic number of threads, then you will automatically be testing concurrency and at the same time running genuine, valid and useful performance tests too.
So, basically, drop the synchronising timer, use a constant throughput timer instead, configure wait times and then calculate the correct number of threads to generate the desired load.
The added bonus to this approach is you will be much less likely to raise false negatives. For example, if you hit your server with 5 simultaneous login requests then you might find that this call is single-threaded and the response times increase. But maybe this doesn't matter, maybe the chances of two login calls at the same time are so small that it is not worth spending time changing the code. This is a very, very important concept in load testing - perhaps the most important - you must have realistic objectives, without these you could be running tests, finding false bugs and generally wasting time forever.

Django/Postgres performance worsening after repeatedly processing the same query

I am running Django on Apache. I have several client computers which should call urllib2.urlopen() and send over some data which my server will process and immediately send back a reply. However, when I am testing this I found a very tricky issue. I have one client repeatedly send the same data to be processed. The first time, it takes around ~20 seconds, second time, it takes about 40 seconds, third time I get a 504 (gateway timeout) error. If I try to send data some more 504 errors randomly pop up. I am pretty sure this is an issue with Postgres as the function that processes the information makes many database calls, however, I do not know why the performance of Postgres would decline so much. I have tried several database optimization tricks, including this one: (http://stackoverflow.com/questions/1125504/django-persistent-database-connection), to no avail.
Thanks in advance.
Edit: The requests are not coming concurrently. They are coming in back to back and each query involves a lot of SELECTs and JOINs, and there are a few INSERTs and UPDATEs as well. The apache error logs show that it is just a simple timeout, where the function to process the client posted data takes over 90 seconds.
If it's really Postgres, then you should turn on the logging of slow statements in the Postgres configuration to find out which statement exactly is taking so much time.
This can be done by setting the configuration property log_min_duration.
Details are in the manual:
http://www.postgresql.org/docs/current/static/runtime-config-logging.html#GUC-LOG-MIN-DURATION-STATEMENT
You say the function makes "many database calls" so I'd start with a very low number, or even 0 to log the duration of all statements, then you might be able to identify the slow ones.
It could also be a locking issued. Maybe the first call does not end its transaction properly and subsequent calls run into a timeout when waiting for a resource.
You can verify this by checking the system view pg_locks after the first call.
Have you checked the Apache error_logs? Have you set django DEBUG = True or ADMINS = ('email#addr.com',) so you can get a detailed error report about what the actual cause of the issue is? If so, how about pasting some information here.
Why are you certain that it's postgres? Have you done diagnostics to come to that conclusion? If so, please let us know.
Are you running apache with mod_wsgi? How many processes and threads have you allocated to your django application?
Also, 20 seconds to process the first transaction is a huge amount of time. Perhaps you could show us the view code that is causing the time out. We may be able to help there.
I sincerely doubt that it's going to be postgres alone that is causing the issue. It probably has something to do with application code, or server configuration.

Filemaker XSL 20sec Query Latency

I have an ASP frontend that loads data from a Filemaker database using XSL to perform simple queries. The problem is that the first page load takes 20 seconds +/- 200ms, then the next few page refreshes within a minute of the first request take <200ms, then the cycle starts over again.
Each page load makes only 2 XSL queries, and they execute fast after the first page load, so what is causing the delay on the first page load? I have caching turned up with a 100% hit rate, and number of connections at 100. I've tried with XSL database sessions on and off, and session time anywhere from 1 to 60 minutes without any changes.
The XSL loads from ASP use a GET request and add a Basic Authorization header to authenticate each time.
During fast page requests, the fmserver.exe and fmswpc.exe processes don't even flinch, but during a 20 second holdup I see fmserver jump to 30% CPU and a 3mb I/O read a few seconds into the request, and occasionally fmswpc jump to 60% CPU.
If you're accessing the FileMaker server on the same machine, be sure to use '127.0.0.1' instead of 'localhost'.
Found the problem, for some reason it was the Authorization header that caused the lag. If I gave the guest account full access and removed that header, every request was fast. Go figure.