I have a django application running behind varnish and nginx.
There is a periodic task running every two minutes, accessing a locally running jsonrpc daemon and updating a django model with the result.
Sometimes the django app is not responding, ending up in an nginx gateway failed message. Looking through the logs it seems that when this happens the backend task accessing the jsonrpc daemon is also timing out.
The task itself is pretty simple: A value is requested from jsonrpc daemon and saved in a django model, either updating an existing entry or creating a new one. I don't think that any database deadlock is involved here.
I am a bit lost in how to track this down. To start, I don't know if the timeout of the task is causing the overall site timeout OR if some other problem is causing BOTH timeouts. After all, a timout in the asynchronous task should not have any influence on the website response?
Related
I'm running a very simple web server using Django on Gunicorn with Gevent workers which communicate with MySQL for simple crud type operations. All of this is behind nginx and hosted on AWS. I'm running my app server using the following config:
gunicorn --logger-class=simple --timeout 30 -b :3000 -w 5 -k gevent my_app.wsgi:application
However, sometimes, the workers just get stuck (sometimes when # of requests increase. Sometimes even without it)and the TPS drops with nginx returning 499 HTTP error code. Sometimes, workers have started getting killed (WORKER TIMEOUT) and the requests are dropped.
I'm unable to find a way to debug where the workers are getting stuck. I've checked the slow logs of MySQL, and that is not the problem here.
In Java, I can take jstack to see the threads state or some other mechanisms like takipi which provides with the then state of threads when an exception comes.
To all the people out there who can help, I call upon you to help me find a way to see the internal state of a hosted python web server i.e.
workers state at a given point
threads state at a given point
which all requests a particular gevent worker have started processing and when it gets stuck/killed, where is it actually stuck
which all requests got terminated because of a worker getting killed
etc
I've been looking for it and have found many people facing similar issues, but their solutions seem hit-and-trial and nowhere is the steps mentioned on how to deep down on this.
Introduction
I encountered this very interesting issue this week, better start with some facts:
pylibmc is not thread safe, when used as django memcached backend, starting multiple django instance directly in shell would crash when hit with concurrent requests.
if deploy with nginx + uWSGI, this problem with pylibmc magically dispear.
if you switch django cache backend to python-memcached, it too will solve this problem, but this question isn't about that.
Elaboration
start with the first fact, this is how I reproduced the pylibmc issue:
The failure of pylibmc
I have a django app which does a lot of memcached reading and writing, and there's this deployment strategy, that I start multiple django process in shell, binding to different ports (8001, 8002), and use nginx to do the balance.
I initiated two separate load test against these two django instance, using locust, and this is what happens:
In the above screenshot they both crashed and reported exactly the same issue, something like this:
Assertion "ptr->query_id == query_id +1" failed for function "memcached_get_by_key" likely for "Programmer error, the query_id was not incremented.", at libmemcached/get.cc:107
uWSGI to the rescue
So in the above case, we learned that multi-thread concurrent request towards memcached via pylibmc could cause issue, this somehow doesn't bother uWSGI with multiple worker process.
To prove that, I start uWSGI with the following settings included:
master = true
processes = 2
This tells uWSGI to start two worker process, I then tells nginx to server any django static files, and route non-static requests to uWSGI, to see what happens. With the server started, I launch the same locust test against django in localhost, and make sure there's enough requests per seconds to cause concurrent request against memcached, here's the result:
In the uWSGI console, there's no sign of dead worker processes, and no worker has been re-spawn, but looking at the upper part of the screenshot, there sure has been concurrent requests (5.6 req/s).
The question
I'm extremely curious about how uWSGI make this go away, and I couldn't learn that on their documentation, to recap, the question is:
How did uWSGI manage worker process, so that multi-thread memcached requests didn't cause django to crash?
In fact I'm not even sure that it's the way uWSGI manages worker processes that avoid this issue, or some other magic that comes with uWSGI that's doing the trick, I've seen something called a memcached router in their documentation that I didn't quite understand, does that relate?
Isn't it because you actually have two separate processes managed by uWSGI? As you are setting the processes option instead of the workers option, so you should actually have multiple uWSGI processes (I'm assuming a master + two workers because of the config you used). Each of those processes will have it's own loaded pylibmc, so there is not state sharing between threads (you haven't configured threads on uWSGI after all).
I'm running apache with django and mod_wsgi enabled in 2 different processes.
I read that the second process is a on-change listener for reloading code on change, but for some reason the ready() function of my AppConfig class is being executed twice. This function should only run once.
I understood that running django runserver with the --noreload flag will resolve the problem on development mode, but I cannot find a solution for this in production mode on my apache webserver.
I have two questions:
How can I run with only one process in production or at least make only one process run the ready() function ?
Is there a way to make the ready() function run not in a lazy mode? By this, I mean execute only on on server startup, not on first request.
For further explanation, I am experiencing a scenario as follows:
The ready() function creates a folder listener such as pyinotify. That listener will listen on a folder on my server and enqueue a task on any changes.
I am seeing this listener executed twice on any changes to a single file in the monitored directory. This leads me to believe that both processes are running my listener.
No, the second process is not an onchange listener - I don't know where you read that. That happens with the dev server, not with mod_wsgi.
You should not try to prevent Apache from serving multiple processes. If you do, the speed of your site will be massively reduced: it will only be able to serve a single request at a time, with others queued until the first finishes. That's no good for anything other than a toy site.
Instead, you should fix your AppConfig. Rather than blindly spawning a listener, you should check to see if it has already been created before starting a new one.
You shouldn't prevent spawning multiple processes, because it's good thing, especially on production environment. You should consider using some external tool, separated from django or add check if folder listening is already running (for example monitor persistence of PID file and it's content).
I'm starting to get familiar with the RabbitMQ lingo so I'll try my best to explain. I'll be going into a public beta test in a few weeks and this is the set up I am hoping to achieve. I would like Django to be the producer; producing messages to a remote RabbitMQ box and another Celery box listening on the RabbitMQ queue for tasks. So in total there would be three boxes. Django, RabbitMQ & Celery. So far, from the Celery docs, I have successfully been able to run Django and Celery together and Rabbit MQ on another machine. Django simply calls the task in the view:
add.delay(3, 3)
And the message is sent over to RabbitMQ. RabbitMQ sends it back to the same machine that the task was sent from (since Django and celery share the same box) and celery processes the task.
This is great for development purposes. However, having Django and Celery running on the same box isn't a great idea since both will have to compete for memory and CPU. The whole goal here is to get clients in and out of the HTTP Request cycle and have celery workers process the tasks. But the machine will slow down considerably if it is accepting HTTP requests and also processing tasks.
So I was wondering is there was a way to make this all separate from one another. Have Django send the tasks, RabbitMQ forward them, and Celery process them (Producer, Broker, Consumer).
How can I go about doing this? Really simple examples would help!
What you need is to deploy the code of your application on the third machine and execute there only the command that handles the tasks. You need to have the code on that machine also.
I have a Django server running on Apache via mod_wsgi. I have a massive background task, called via a API call, that searches emails in the background (generally takes a few hours) that is done in the background.
In order to facilitate debugging - as exceptions and everything else happen in the background - I created a API call to run the task blocking. So the browser actually blocks for those hours and receives the results.
In localhost this is fine. However, in the real Apache environment, after about 30 minutes I get a 504 Gateway Timeout error.
How do I change the settings so that Apache allows - just in this debug phase - for the HTTP request to block for a few hours without returning a 504 Gateway Timeout?
I'm assuming this can be changed in the Apache configuration.
You should not be doing long running tasks within Apache processes, nor even waiting for them. Use a background task queueing system such as Celery to run them. Have any web request return as soon as it is queued and implement some sort of polling mechanism as necessary to see if the job is complete and results can be obtained.
Also, are you sure the 504 isn't coming from some front end proxy (explicit or transparent) or load balancer? There is no default timeout in Apache which is 30 minutes.