why uwsgi workers idle but nginx showed a lot of timeouts? - django

Stack: nginx, uwsgi, django
uwsgitop and top both showed uwsgi workers are idle, while nginx error log said upstream timed out.
I thought some of request needed a lot of resources such as waiting for db or cache, while the others not. After checking the timed out requests, most of them were not voracious. Any kind of request had been timed out.
So why nginx didn't seed the requests to idle ones if the others were really busy? why uwsgi master just keep someone busy and the others idle?

I'd like to answer my own question.
change the kernel parameter: net.ipv4.ip_conntrack_max from 65560 to 6556000
I have a full story on how we found the answer:
user said slow, slow, slow
nginx flooded with "upstream connection timed out"
I checked uwsgi log, found some of errors, fixed it; found more, fix more, and this loop lasted days. Till yesterday, I thought there was no relevance with uwsgi, memcached, db, redis, or anything backend because uwsgi were idle
so I thought nginx must have had something wrong, reload, restart, check connections, workers, proxy_read_timeout, etc. no luck.
checked ulimit -n, which reported 1024, the default one. I have 8 nginx workers, so connections should reach to 1024 * 8, I thought that could be ok as nginx never said too many open files. Anyway, I changed it to 4096. NO luck.
check connections number, and the state, then problem appears. upstream connections were all in syn_sent state, then timeout happends. Only 2 or 3 of 300 connections are in established state. We wanted to know why. One of my friends told my part to use tcpdump, the magic tool I never dare to try once.
Then we go to syslog and found the following error, and finally we resolved the problem

I had a similar issue where my listen queue was growing and the RPS was low despite all the workers idling.
samuel found one of the cases but there are a few more potential causes for this behavior:
net.core.somaxconn not high enough
uwsgi --listen not high enough
nginx worker processes too low
nginx worker connections too low
If none of these work, then you'll want to check your logs that inbound requests to uWSGI are under http/1.1 and not http/1.0, and then use --http11-socket
Here are some of my findings when I wrestled with this issue: https://wontonst.blogspot.com/2019/06/squishing-performance-bug-in.html
The nginx tuning page also has some other configurations that may or may not be useful in solving this issue:
https://www.nginx.com/blog/tuning-nginx/

Related

504 timeout on AWS with nginx and gunicorn

I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python

uwsgi + nginx: status 502 on heavy load

I have this nginx + uwsgi + django application which works pretty well.
But when load on server gets a bit heavier, I start getting these 502 errors on random requests.
Here's the relevant section of uwsgi.ini:
master = true
#more processes, more computing power
processes = 32
threads = 3
post-buffering = 1
# this one must match net.core.somaxconn
listen = 8000
I have this huge number of processes to handle larger numbers of simultaneous requests - and I even added some threads on top of that - but I must admit I expected the listen directive should handle that. In reality it seems 502 errors start happening the moment I exceed 32 * 3 simultaneous requests.
The way I understood it, listen should allow up to 8000 (in this case) of active connections, waiting to be served by the server, but it doesn't seem to have any serious effect.
uwsgi log clearly says this is active:
your server socket listen backlog is limited to 8000 connections
But I seem to be misunderstanding the effect of the setting.
Anyway, I'd just like to solve the 502, so I'm open for suggestions in any direction.
I had a similar issue where under load, I start to get 502's back from uWSGI.
Long story short, check that you are on HTTP/1.1, that you are using --http11-socket, and that you are on uWSGI=2.0.16 or later.
http/1.1 does a persistent connection by default. I found that 502's occur when the listen queue grows and connections start to get dropped. Have a look at my article for the whole story
https://wontonst.blogspot.com/2019/06/squishing-performance-bug-in.html

Gunicorn + Gevent : Debugging workers stuck state/ WORKER TIMEOUT cause

I'm running a very simple web server using Django on Gunicorn with Gevent workers which communicate with MySQL for simple crud type operations. All of this is behind nginx and hosted on AWS. I'm running my app server using the following config:
gunicorn --logger-class=simple --timeout 30 -b :3000 -w 5 -k gevent my_app.wsgi:application
However, sometimes, the workers just get stuck (sometimes when # of requests increase. Sometimes even without it)and the TPS drops with nginx returning 499 HTTP error code. Sometimes, workers have started getting killed (WORKER TIMEOUT) and the requests are dropped.
I'm unable to find a way to debug where the workers are getting stuck. I've checked the slow logs of MySQL, and that is not the problem here.
In Java, I can take jstack to see the threads state or some other mechanisms like takipi which provides with the then state of threads when an exception comes.
To all the people out there who can help, I call upon you to help me find a way to see the internal state of a hosted python web server i.e.
workers state at a given point
threads state at a given point
which all requests a particular gevent worker have started processing and when it gets stuck/killed, where is it actually stuck
which all requests got terminated because of a worker getting killed
etc
I've been looking for it and have found many people facing similar issues, but their solutions seem hit-and-trial and nowhere is the steps mentioned on how to deep down on this.

Very slow: ActiveRecord::QueryCache#call

I have an app on heroku, running on Puma:
workers 2
threads_count 3
pool 5
It looks like some requests get stuck in the middleware, and it makes the app very slow (VERY!).
I have seen other people threads about this problem but no solution so far.
Please let me know if you have any hint.
!
!
I work for Heroku support and Middleware/Rack/ActiveRecord::QueryCache#call is a commonly reported as a problem by New Relic. Unfortunately, it's usually a red herring as each time the source of the problem lies elsewhere.
QueryCache is where Rails first tries to check out a connection for use, so any problems with a connection will show up here as a request getting 'stuck' waiting. This doesn't mean the database server is out of connections necessarily (if you have Librato charts for Postgres they will show this). It likely means something is causing certain database connections to enter a bad state, and new requests for a connection are waiting. This can occur in older versions of Puma where multiple threads are used and the reaping_frequency is set - if some connections get into a bad state and the others are reaped this will cause problems.
Some high-level suggestions are as follows:
Upgrade Ruby & Puma
If using the rack-timeout gem, upgrade that too
These upgrades often help. If not, there are other options to look into such as switching from threads to worker based processes or using a Postgres connection pool such as PgBouncer. We have more suggestions on configuring concurrent web servers for use with Postgres here: https://devcenter.heroku.com/articles/concurrency-and-database-connections
I will answer my own question:
I simply had to check all queries to my DB. One of them was taking a VERY long time, and even if it was not requested often, it would slow down the whole server for quite some time afterwards(even after the process was done, there was a sort of "traffic jam" on the server).
Solution:
Check all the queries to your database, fix the slowest ones (it might simply mean breaking it down in few steps, it might mean make it run at night when there is no traffic, etc...).
Once this queries are fixed, everything should go back to normal.
I recently started seeing a spike in time spent in ActiveRecord::QueryCache#call. After looking at the source, I decided to try clearing said cache using ActiveRecord::Base.connection.clear_query_cache from a Rails Console attached to the production environment. The error I got back was PG::ConnectionBad: could not fork new process for connection: Cannot allocate memory which lead me to this other SO question at least Heroku Rails could not fork new process for connection: Cannot allocate memory

uWSGI + nginx for django app avoids pylibmc multi-thread concurrency issue?

Introduction
I encountered this very interesting issue this week, better start with some facts:
pylibmc is not thread safe, when used as django memcached backend, starting multiple django instance directly in shell would crash when hit with concurrent requests.
if deploy with nginx + uWSGI, this problem with pylibmc magically dispear.
if you switch django cache backend to python-memcached, it too will solve this problem, but this question isn't about that.
Elaboration
start with the first fact, this is how I reproduced the pylibmc issue:
The failure of pylibmc
I have a django app which does a lot of memcached reading and writing, and there's this deployment strategy, that I start multiple django process in shell, binding to different ports (8001, 8002), and use nginx to do the balance.
I initiated two separate load test against these two django instance, using locust, and this is what happens:
In the above screenshot they both crashed and reported exactly the same issue, something like this:
Assertion "ptr->query_id == query_id +1" failed for function "memcached_get_by_key" likely for "Programmer error, the query_id was not incremented.", at libmemcached/get.cc:107
uWSGI to the rescue
So in the above case, we learned that multi-thread concurrent request towards memcached via pylibmc could cause issue, this somehow doesn't bother uWSGI with multiple worker process.
To prove that, I start uWSGI with the following settings included:
master = true
processes = 2
This tells uWSGI to start two worker process, I then tells nginx to server any django static files, and route non-static requests to uWSGI, to see what happens. With the server started, I launch the same locust test against django in localhost, and make sure there's enough requests per seconds to cause concurrent request against memcached, here's the result:
In the uWSGI console, there's no sign of dead worker processes, and no worker has been re-spawn, but looking at the upper part of the screenshot, there sure has been concurrent requests (5.6 req/s).
The question
I'm extremely curious about how uWSGI make this go away, and I couldn't learn that on their documentation, to recap, the question is:
How did uWSGI manage worker process, so that multi-thread memcached requests didn't cause django to crash?
In fact I'm not even sure that it's the way uWSGI manages worker processes that avoid this issue, or some other magic that comes with uWSGI that's doing the trick, I've seen something called a memcached router in their documentation that I didn't quite understand, does that relate?
Isn't it because you actually have two separate processes managed by uWSGI? As you are setting the processes option instead of the workers option, so you should actually have multiple uWSGI processes (I'm assuming a master + two workers because of the config you used). Each of those processes will have it's own loaded pylibmc, so there is not state sharing between threads (you haven't configured threads on uWSGI after all).