Why my Flask backend is unstable on Heroku? [duplicate] - flask

This question already has answers here:
Are global variables thread-safe in Flask? How do I share data between requests?
(4 answers)
Closed 2 years ago.
I created a small backend API for a game. When a user creates a game (a request to the API is done), Python creates a new instance of this game (to be more precise, I add a game in a dict). The user gets the game id in the response and can now play (the frontend calls several routes to update the state of this game).
It works perfectly locally, however on Heroku it is very unstable: I use polling and approximately 50% of the requests fail because the game id can not be found.
I can't figure out why the backend sometimes finds the game and sometimes not.
Does anybody have an idea of what went wrong?
Thank you very much.

This sounds like it could be due to the way you've implemented in-memory storage. If it's not thread-safe the app might work fully in development, but when deployed with a WSGI server like gunicorn with several worker processes/threads, each with their own memory, it could lead to strange behaviour as you descibe.
What's more, Heroku is quirky.
Here's the output of gunicorn --help when installed on any-old-system through pip which defaults to 1 worker if the -w flag is not provided:
-w INT, --workers INT
The number of worker processes for handling requests. [1]
However when executed via the Heroku console, notice that it defaults to 2:
-w INT, --workers INT
The number of worker processes for handling requests. [2]
Heroku appear to have customised their gunicorn build for some reason (edit: figured out how), so the following Procfile launches with 2 workers:
web: gunicorn some:app
Where-as on a non-Heroku system this would launch with a single worker.
You'll probably find the following Procfile will solve your issue:
web: gunicorn --workers 1 some:app
This is, of course, suitable if it's a small project which doesn't need to scale to several workers. To mitigate this issue and scale the application, you may need to investigate making code changes to implement a separate storage backend (eg. Redis) within your app.

Related

Gunicorn + Gevent : Debugging workers stuck state/ WORKER TIMEOUT cause

I'm running a very simple web server using Django on Gunicorn with Gevent workers which communicate with MySQL for simple crud type operations. All of this is behind nginx and hosted on AWS. I'm running my app server using the following config:
gunicorn --logger-class=simple --timeout 30 -b :3000 -w 5 -k gevent my_app.wsgi:application
However, sometimes, the workers just get stuck (sometimes when # of requests increase. Sometimes even without it)and the TPS drops with nginx returning 499 HTTP error code. Sometimes, workers have started getting killed (WORKER TIMEOUT) and the requests are dropped.
I'm unable to find a way to debug where the workers are getting stuck. I've checked the slow logs of MySQL, and that is not the problem here.
In Java, I can take jstack to see the threads state or some other mechanisms like takipi which provides with the then state of threads when an exception comes.
To all the people out there who can help, I call upon you to help me find a way to see the internal state of a hosted python web server i.e.
workers state at a given point
threads state at a given point
which all requests a particular gevent worker have started processing and when it gets stuck/killed, where is it actually stuck
which all requests got terminated because of a worker getting killed
etc
I've been looking for it and have found many people facing similar issues, but their solutions seem hit-and-trial and nowhere is the steps mentioned on how to deep down on this.

uWSGI + nginx for django app avoids pylibmc multi-thread concurrency issue?

Introduction
I encountered this very interesting issue this week, better start with some facts:
pylibmc is not thread safe, when used as django memcached backend, starting multiple django instance directly in shell would crash when hit with concurrent requests.
if deploy with nginx + uWSGI, this problem with pylibmc magically dispear.
if you switch django cache backend to python-memcached, it too will solve this problem, but this question isn't about that.
Elaboration
start with the first fact, this is how I reproduced the pylibmc issue:
The failure of pylibmc
I have a django app which does a lot of memcached reading and writing, and there's this deployment strategy, that I start multiple django process in shell, binding to different ports (8001, 8002), and use nginx to do the balance.
I initiated two separate load test against these two django instance, using locust, and this is what happens:
In the above screenshot they both crashed and reported exactly the same issue, something like this:
Assertion "ptr->query_id == query_id +1" failed for function "memcached_get_by_key" likely for "Programmer error, the query_id was not incremented.", at libmemcached/get.cc:107
uWSGI to the rescue
So in the above case, we learned that multi-thread concurrent request towards memcached via pylibmc could cause issue, this somehow doesn't bother uWSGI with multiple worker process.
To prove that, I start uWSGI with the following settings included:
master = true
processes = 2
This tells uWSGI to start two worker process, I then tells nginx to server any django static files, and route non-static requests to uWSGI, to see what happens. With the server started, I launch the same locust test against django in localhost, and make sure there's enough requests per seconds to cause concurrent request against memcached, here's the result:
In the uWSGI console, there's no sign of dead worker processes, and no worker has been re-spawn, but looking at the upper part of the screenshot, there sure has been concurrent requests (5.6 req/s).
The question
I'm extremely curious about how uWSGI make this go away, and I couldn't learn that on their documentation, to recap, the question is:
How did uWSGI manage worker process, so that multi-thread memcached requests didn't cause django to crash?
In fact I'm not even sure that it's the way uWSGI manages worker processes that avoid this issue, or some other magic that comes with uWSGI that's doing the trick, I've seen something called a memcached router in their documentation that I didn't quite understand, does that relate?
Isn't it because you actually have two separate processes managed by uWSGI? As you are setting the processes option instead of the workers option, so you should actually have multiple uWSGI processes (I'm assuming a master + two workers because of the config you used). Each of those processes will have it's own loaded pylibmc, so there is not state sharing between threads (you haven't configured threads on uWSGI after all).

Django and Celery - How to Distribute?

I'm trying to distribute Django and Celery.
I've created a small project with Django and Celery. Django will request a Celery Worker to work on some data on the database. Then the data is passed back to Django.
My idea is that:
Django stack installed on one server
Message queue (RabbitMQ) on one server
Celery worker on one server
Hence 3 Servers in Total
However, the problem is celery has to use some code from Django, for example models, because it accesses the model. Hence, it would also require settings.py file to know what are the servers.
Does this mean that for #3, I would need to install Django and Celery on the server, but disable Django and only run celery? For example celery -A PROJECT_NAME worker -l INFO, but without an Apache Server for Django?
If you want your celery workers to operate on a different server, you need to make sure that all the resources required by the worker are accessible from that server.
For example, if you have a simple task, you can copy only the code required for that task to the server. If your worker needs any other resources like some other code, files, db you need to make sure it has access.
Really, if you want to have two servers working on the same tasks, you will have to use a simple web interface (such as Flask) to communicate between the servers (and extend the functionality of your queue). Then, you will have to ensure they are both using the same data source.
Consider hosting your database remotely, or have the remote server access the database remotely. Either way, any workers running on a server will need access to the database and all source code necessary to complete the task. Then, you must simply have the two servers share a messaging queue.
Source: how to configure and run celery worker on remote system

How to improve number of request django on heroku can handle at a moment?

In my project, I use Django and heroku to deploy it. In Heroku, I use uWSGI server (with asynchronous mode), database is MySQL (on AWS RDS). I used 7 dyno for scaling django app
When I run stress test with 600 request/second, timeout is 30 second.
My server return > 50% with timeouts request.
Any ideas can help me improve my server performance?
If your async setup is right (and this is the hardest part), well your only solution is adding more dynos. If you are not sure about django+async (or if you have not done any particular customization to make them work together), you have probably a screwed up setup (no concurrency at all).
Take in account that uWSGI async mode could means dozens of different setup (gevent, ugreen, callbacks, greenlets...), so some detail on your configuration could help.

concurrent requests on dotcloud with django

I have a django app I want to migrate to dotcloud.
Many actions in Django internals and in my app are not asynchronous, i.e. they block the thread until they finish.
When I was using Apache, that didn't pose a problem since a different thread is opened on every request. But it doesn't seem to be the case in nginx/uwsgi that dotcloud use.
Seemingly, uwsgi has a --enable-threads and --threads options that can be used for multithreading, but:
It is not clear what version of uwsgi dotcloud use, and if they support these features
Since I have no one else asking about this, I was wondering if this is really the right way to get the concurrent requests running (using threads)
You could run Django with Gunicorn. Gunicorn, in turn, supports multiple worker classes, and people reported success running gunicorn+gevents+django together[1][2].
To use that on dotCloud, you will probably have to use dotCloud's custom service. If that's something that you want to try, I would personally start with dotCloud's reimplementation of python service using the custom service, and replace uwsgi with gunicorn in it.
I came here looking for some leads, which I found, thanks!
There was a fair amount of leg work left to actually get stuff working, though.
Here is an example app on github that uses gunicorn, gevent, and socketio on dotcloud:
https://github.com/t1m0thy/django-tictactoe/tree/dotcloud
Threads is a problem in python - GIL doesn't allow them to run simultaneously.
So multiprocessing is an answer.
Or you may take a look at gevent. Actually gevent is a kind of a hack (monkey patching of python stack) and so on, but it allows to launch green threads.
I'm not sure if gevent can be combined with django, but google knows ;)