Gunicorn + Gevent : Debugging workers stuck state/ WORKER TIMEOUT cause - django

I'm running a very simple web server using Django on Gunicorn with Gevent workers which communicate with MySQL for simple crud type operations. All of this is behind nginx and hosted on AWS. I'm running my app server using the following config:
gunicorn --logger-class=simple --timeout 30 -b :3000 -w 5 -k gevent my_app.wsgi:application
However, sometimes, the workers just get stuck (sometimes when # of requests increase. Sometimes even without it)and the TPS drops with nginx returning 499 HTTP error code. Sometimes, workers have started getting killed (WORKER TIMEOUT) and the requests are dropped.
I'm unable to find a way to debug where the workers are getting stuck. I've checked the slow logs of MySQL, and that is not the problem here.
In Java, I can take jstack to see the threads state or some other mechanisms like takipi which provides with the then state of threads when an exception comes.
To all the people out there who can help, I call upon you to help me find a way to see the internal state of a hosted python web server i.e.
workers state at a given point
threads state at a given point
which all requests a particular gevent worker have started processing and when it gets stuck/killed, where is it actually stuck
which all requests got terminated because of a worker getting killed
etc
I've been looking for it and have found many people facing similar issues, but their solutions seem hit-and-trial and nowhere is the steps mentioned on how to deep down on this.

Related

Why my Flask backend is unstable on Heroku? [duplicate]

This question already has answers here:
Are global variables thread-safe in Flask? How do I share data between requests?
(4 answers)
Closed 2 years ago.
I created a small backend API for a game. When a user creates a game (a request to the API is done), Python creates a new instance of this game (to be more precise, I add a game in a dict). The user gets the game id in the response and can now play (the frontend calls several routes to update the state of this game).
It works perfectly locally, however on Heroku it is very unstable: I use polling and approximately 50% of the requests fail because the game id can not be found.
I can't figure out why the backend sometimes finds the game and sometimes not.
Does anybody have an idea of what went wrong?
Thank you very much.
This sounds like it could be due to the way you've implemented in-memory storage. If it's not thread-safe the app might work fully in development, but when deployed with a WSGI server like gunicorn with several worker processes/threads, each with their own memory, it could lead to strange behaviour as you descibe.
What's more, Heroku is quirky.
Here's the output of gunicorn --help when installed on any-old-system through pip which defaults to 1 worker if the -w flag is not provided:
-w INT, --workers INT
The number of worker processes for handling requests. [1]
However when executed via the Heroku console, notice that it defaults to 2:
-w INT, --workers INT
The number of worker processes for handling requests. [2]
Heroku appear to have customised their gunicorn build for some reason (edit: figured out how), so the following Procfile launches with 2 workers:
web: gunicorn some:app
Where-as on a non-Heroku system this would launch with a single worker.
You'll probably find the following Procfile will solve your issue:
web: gunicorn --workers 1 some:app
This is, of course, suitable if it's a small project which doesn't need to scale to several workers. To mitigate this issue and scale the application, you may need to investigate making code changes to implement a separate storage backend (eg. Redis) within your app.

uWSGI + nginx for django app avoids pylibmc multi-thread concurrency issue?

Introduction
I encountered this very interesting issue this week, better start with some facts:
pylibmc is not thread safe, when used as django memcached backend, starting multiple django instance directly in shell would crash when hit with concurrent requests.
if deploy with nginx + uWSGI, this problem with pylibmc magically dispear.
if you switch django cache backend to python-memcached, it too will solve this problem, but this question isn't about that.
Elaboration
start with the first fact, this is how I reproduced the pylibmc issue:
The failure of pylibmc
I have a django app which does a lot of memcached reading and writing, and there's this deployment strategy, that I start multiple django process in shell, binding to different ports (8001, 8002), and use nginx to do the balance.
I initiated two separate load test against these two django instance, using locust, and this is what happens:
In the above screenshot they both crashed and reported exactly the same issue, something like this:
Assertion "ptr->query_id == query_id +1" failed for function "memcached_get_by_key" likely for "Programmer error, the query_id was not incremented.", at libmemcached/get.cc:107
uWSGI to the rescue
So in the above case, we learned that multi-thread concurrent request towards memcached via pylibmc could cause issue, this somehow doesn't bother uWSGI with multiple worker process.
To prove that, I start uWSGI with the following settings included:
master = true
processes = 2
This tells uWSGI to start two worker process, I then tells nginx to server any django static files, and route non-static requests to uWSGI, to see what happens. With the server started, I launch the same locust test against django in localhost, and make sure there's enough requests per seconds to cause concurrent request against memcached, here's the result:
In the uWSGI console, there's no sign of dead worker processes, and no worker has been re-spawn, but looking at the upper part of the screenshot, there sure has been concurrent requests (5.6 req/s).
The question
I'm extremely curious about how uWSGI make this go away, and I couldn't learn that on their documentation, to recap, the question is:
How did uWSGI manage worker process, so that multi-thread memcached requests didn't cause django to crash?
In fact I'm not even sure that it's the way uWSGI manages worker processes that avoid this issue, or some other magic that comes with uWSGI that's doing the trick, I've seen something called a memcached router in their documentation that I didn't quite understand, does that relate?
Isn't it because you actually have two separate processes managed by uWSGI? As you are setting the processes option instead of the workers option, so you should actually have multiple uWSGI processes (I'm assuming a master + two workers because of the config you used). Each of those processes will have it's own loaded pylibmc, so there is not state sharing between threads (you haven't configured threads on uWSGI after all).

Remote Django application sending messages to RabbitMQ

I'm starting to get familiar with the RabbitMQ lingo so I'll try my best to explain. I'll be going into a public beta test in a few weeks and this is the set up I am hoping to achieve. I would like Django to be the producer; producing messages to a remote RabbitMQ box and another Celery box listening on the RabbitMQ queue for tasks. So in total there would be three boxes. Django, RabbitMQ & Celery. So far, from the Celery docs, I have successfully been able to run Django and Celery together and Rabbit MQ on another machine. Django simply calls the task in the view:
add.delay(3, 3)
And the message is sent over to RabbitMQ. RabbitMQ sends it back to the same machine that the task was sent from (since Django and celery share the same box) and celery processes the task.
This is great for development purposes. However, having Django and Celery running on the same box isn't a great idea since both will have to compete for memory and CPU. The whole goal here is to get clients in and out of the HTTP Request cycle and have celery workers process the tasks. But the machine will slow down considerably if it is accepting HTTP requests and also processing tasks.
So I was wondering is there was a way to make this all separate from one another. Have Django send the tasks, RabbitMQ forward them, and Celery process them (Producer, Broker, Consumer).
How can I go about doing this? Really simple examples would help!
What you need is to deploy the code of your application on the third machine and execute there only the command that handles the tasks. You need to have the code on that machine also.

Asynchronous celery task is blocking main application (maybe)

I have a django application running behind varnish and nginx.
There is a periodic task running every two minutes, accessing a locally running jsonrpc daemon and updating a django model with the result.
Sometimes the django app is not responding, ending up in an nginx gateway failed message. Looking through the logs it seems that when this happens the backend task accessing the jsonrpc daemon is also timing out.
The task itself is pretty simple: A value is requested from jsonrpc daemon and saved in a django model, either updating an existing entry or creating a new one. I don't think that any database deadlock is involved here.
I am a bit lost in how to track this down. To start, I don't know if the timeout of the task is causing the overall site timeout OR if some other problem is causing BOTH timeouts. After all, a timout in the asynchronous task should not have any influence on the website response?

why uwsgi workers idle but nginx showed a lot of timeouts?

Stack: nginx, uwsgi, django
uwsgitop and top both showed uwsgi workers are idle, while nginx error log said upstream timed out.
I thought some of request needed a lot of resources such as waiting for db or cache, while the others not. After checking the timed out requests, most of them were not voracious. Any kind of request had been timed out.
So why nginx didn't seed the requests to idle ones if the others were really busy? why uwsgi master just keep someone busy and the others idle?
I'd like to answer my own question.
change the kernel parameter: net.ipv4.ip_conntrack_max from 65560 to 6556000
I have a full story on how we found the answer:
user said slow, slow, slow
nginx flooded with "upstream connection timed out"
I checked uwsgi log, found some of errors, fixed it; found more, fix more, and this loop lasted days. Till yesterday, I thought there was no relevance with uwsgi, memcached, db, redis, or anything backend because uwsgi were idle
so I thought nginx must have had something wrong, reload, restart, check connections, workers, proxy_read_timeout, etc. no luck.
checked ulimit -n, which reported 1024, the default one. I have 8 nginx workers, so connections should reach to 1024 * 8, I thought that could be ok as nginx never said too many open files. Anyway, I changed it to 4096. NO luck.
check connections number, and the state, then problem appears. upstream connections were all in syn_sent state, then timeout happends. Only 2 or 3 of 300 connections are in established state. We wanted to know why. One of my friends told my part to use tcpdump, the magic tool I never dare to try once.
Then we go to syslog and found the following error, and finally we resolved the problem
I had a similar issue where my listen queue was growing and the RPS was low despite all the workers idling.
samuel found one of the cases but there are a few more potential causes for this behavior:
net.core.somaxconn not high enough
uwsgi --listen not high enough
nginx worker processes too low
nginx worker connections too low
If none of these work, then you'll want to check your logs that inbound requests to uWSGI are under http/1.1 and not http/1.0, and then use --http11-socket
Here are some of my findings when I wrestled with this issue: https://wontonst.blogspot.com/2019/06/squishing-performance-bug-in.html
The nginx tuning page also has some other configurations that may or may not be useful in solving this issue:
https://www.nginx.com/blog/tuning-nginx/